Cosine Similarity Algorithm

Arko-Sengupta · 2024-09-03T08:30:46Z

Overview

Introduces a New Implementation of the Cosine Similarity Algorithm in the Cosine_Similarity class. Cosine Similarity is a widely used metric in Natural Language Processing and Information retrieval to measure the similarity between two texts based on their Vector Representations.

Key Features

Vector Representation: Utilizes SpaCy's pre-trained Word Embeddings to convert text into Vectors.
Tokenization: Breaks down input text into lowercased tokens, excluding punctuation.
Vectorization: Converts tokens into their corresponding vectors using SpaCy's embeddings.
Mean Vector Calculation: Computes the Mean Vector for a set of Word Vectors to represent the overall text.
Cosine Similarity Calculation: Measures the cosine of the angle between two vectors, providing a Similarity Score ranging from -1 to 1.
Cosine Similarity Percentage: Outputs the similarity score as a percentage, facilitating easier interpretation.

Mathematical Foundation

Dot Product: Measures the Degree of Alignment between two Vectors.
Magnitude (Norm): Computes the length of a Vector.
Cosine Similarity Formula:
```
 Cosine Similarity = (Dot Product) / (Magnitude_1 * Magnitude_2)
```
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.

Usage

The Cosine_Similarity class provides methods to Tokenize, Vectorize, and calculate the Cosine Similarity between two pieces of text. It includes:

Tokenize(text): Tokenizes the input text into lowercase tokens.
Vectorize(tokens): Converts tokens into vector representations.
Mean_Vector(vectors): Computes the average vector of a list of vectors.
Dot_Product(vector1, vector2): Calculates the dot product of two vectors.
Magnitude(vector): Computes the magnitude of a vector.
Cosine_Similarity(vector1, vector2): Computes the cosine similarity between two vectors.
Cosine_Similarity_Percentage(text1, text2): Calculates the similarity percentage between two texts.

Error Handling

Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.

Benefits

Provides an effective method for comparing textual content.
Leverages pre-trained embeddings for accurate and efficient similarity measurement.
Can be used in various applications including document similarity, search relevance, and recommendation systems.

for more information, see https://pre-commit.ci

algorithms-keeper

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Contributing guidelines

Project Euler solution guidelines

Python:

Formatted string literals (f-strings)

Type hints

doctest

unittest

pytest

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

@algorithms-keeper review to trigger the checks for only added pull request files

@algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

algorithms-keeper · 2024-09-03T08:43:48Z

machine_learning/cosine_similarity.py

+import numpy as np
+
+
+class Cosine_Similarity:


Class names should follow the CamelCase naming convention. Please update the following name accordingly: Cosine_Similarity

algorithms-keeper · 2024-09-03T08:43:48Z

machine_learning/cosine_similarity.py

+        """
+        self.nlp = spacy.load("en_core_web_md")
+
+    def Tokenize(self, text: str) -> list:


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: Tokenize

As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function Tokenize

algorithms-keeper · 2024-09-03T08:43:48Z

machine_learning/cosine_similarity.py

+            logging.error("An error occurred during Tokenization: ", exc_info=e)
+            raise e
+
+    def Vectorize(self, tokens: list) -> list:


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: Vectorize

As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function Vectorize

algorithms-keeper · 2024-09-03T08:43:49Z

machine_learning/cosine_similarity.py

+            logging.error("An error occurred during Vectorization: ", exc_info=e)
+            raise e
+
+    def Mean_Vector(self, vectors: list) -> np.ndarray:


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: Mean_Vector

As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function Mean_Vector

algorithms-keeper · 2024-09-03T08:43:49Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def Dot_Product(self, vector1: np.ndarray, vector2: np.ndarray) -> float:


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: Dot_Product

As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function Dot_Product

algorithms-keeper · 2024-09-03T08:43:49Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def Magnitude(self, vector: np.ndarray) -> float:


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: Magnitude

As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function Magnitude

algorithms-keeper · 2024-09-03T08:43:49Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def Cosine_Similarity(self, vector1: np.ndarray, vector2: np.ndarray) -> float:


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: Cosine_Similarity

As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function Cosine_Similarity

algorithms-keeper · 2024-09-03T08:43:49Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def Cosine_Similarity_Percentage(self, text1: str, text2: str) -> float:


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: Cosine_Similarity_Percentage

As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function Cosine_Similarity_Percentage

for more information, see https://pre-commit.ci

algorithms-keeper

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Contributing guidelines

Project Euler solution guidelines

Python:

Formatted string literals (f-strings)

Type hints

doctest

unittest

pytest

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

@algorithms-keeper review to trigger the checks for only added pull request files

@algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+import numpy as np
+
+
+class cosine_similarity:


Class names should follow the CamelCase naming convention. Please update the following name accordingly: cosine_similarity

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+        """
+        self.nlp = spacy.load("en_core_web_md")
+
+    def tokenize(self, text: str) -> list:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function tokenize

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+            logging.error("An error occurred during Tokenization: ", exc_info=e)
+            raise e
+
+    def vectorize(self, tokens: list) -> list:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function vectorize

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+            logging.error("An error occurred during Vectorization: ", exc_info=e)
+            raise e
+
+    def mean_vector(self, vectors: list) -> np.ndarray:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function mean_vector

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def dot_product(self, vector1: np.ndarray, vector2: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function dot_product

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def magnitude(self, vector: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function magnitude

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def cosine_similarity(self, vector1: np.ndarray, vector2: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function cosine_similarity

algorithms-keeper · 2024-09-03T08:48:17Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def cosine_similarity_percentage(self, text1: str, text2: str) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function cosine_similarity_percentage

for more information, see https://pre-commit.ci

algorithms-keeper

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Contributing guidelines

Project Euler solution guidelines

Python:

Formatted string literals (f-strings)

Type hints

doctest

unittest

pytest

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

@algorithms-keeper review to trigger the checks for only added pull request files

@algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

algorithms-keeper · 2024-09-03T08:58:28Z

machine_learning/cosine_similarity.py

+        """
+        self.nlp = spacy.load("en_core_web_md")
+
+    def tokenize(self, text: str) -> list:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function tokenize

algorithms-keeper · 2024-09-03T08:58:28Z

machine_learning/cosine_similarity.py

+            logging.error("An error occurred during Tokenization: ", exc_info=e)
+            raise e
+
+    def vectorize(self, tokens: list) -> list:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function vectorize

algorithms-keeper · 2024-09-03T08:58:28Z

machine_learning/cosine_similarity.py

+            logging.error("An error occurred during Vectorization: ", exc_info=e)
+            raise e
+
+    def mean_vector(self, vectors: list) -> np.ndarray:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function mean_vector

algorithms-keeper · 2024-09-03T08:58:28Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def dot_product(self, vector1: np.ndarray, vector2: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function dot_product

algorithms-keeper · 2024-09-03T08:58:28Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def magnitude(self, vector: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function magnitude

algorithms-keeper · 2024-09-03T08:58:28Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def cosine_similarity(self, vector1: np.ndarray, vector2: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function cosine_similarity

algorithms-keeper · 2024-09-03T08:58:29Z

machine_learning/cosine_similarity.py

+            )
+            raise e
+
+    def cosine_similarity_percentage(self, text1: str, text2: str) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py, please provide doctest for the function cosine_similarity_percentage

for more information, see https://pre-commit.ci

Arko-Sengupta and others added 4 commits September 3, 2024 13:34

Cosine Similarity Algorithm | Machine Learning

0af293b

[pre-commit.ci] auto fixes from pre-commit.com hooks

3a62339

for more information, see https://pre-commit.ci

Input Fixes

e8ec6df

Input Fixes

1458803

algorithms-keeper bot added the require tests Tests [doctest/unittest/pytest] are required label Sep 3, 2024

algorithms-keeper bot reviewed Sep 3, 2024

View reviewed changes

algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 3, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

030ced3

for more information, see https://pre-commit.ci

algorithms-keeper bot added the tests are failing Do not merge until tests pass label Sep 3, 2024

Lower Case Fixes

768015c

algorithms-keeper bot reviewed Sep 3, 2024

View reviewed changes

pre-commit-ci bot and others added 3 commits September 3, 2024 08:48

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8deb03

for more information, see https://pre-commit.ci

Case Fixes

d597f45

Case Fixes

2479eef

algorithms-keeper bot reviewed Sep 3, 2024

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa91225

for more information, see https://pre-commit.ci

Arko-Sengupta closed this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosine Similarity Algorithm | Machine Learning #11536

Cosine Similarity Algorithm | Machine Learning #11536

Arko-Sengupta commented Sep 3, 2024

algorithms-keeper bot left a comment

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot left a comment

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot left a comment

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

algorithms-keeper bot Sep 3, 2024

Cosine Similarity Algorithm | Machine Learning #11536

Cosine Similarity Algorithm | Machine Learning #11536

Conversation

Arko-Sengupta commented Sep 3, 2024