-
-
Notifications
You must be signed in to change notification settings - Fork 46.6k
Cosine Similarity Algorithm | Machine Learning #11536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper
commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper review
to trigger the checks for only added pull request files@algorithms-keeper review-all
to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
import numpy as np | ||
|
||
|
||
class Cosine_Similarity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class names should follow the CamelCase
naming convention. Please update the following name accordingly: Cosine_Similarity
""" | ||
self.nlp = spacy.load("en_core_web_md") | ||
|
||
def Tokenize(self, text: str) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable and function names should follow the snake_case
naming convention. Please update the following name accordingly: Tokenize
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function Tokenize
logging.error("An error occurred during Tokenization: ", exc_info=e) | ||
raise e | ||
|
||
def Vectorize(self, tokens: list) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable and function names should follow the snake_case
naming convention. Please update the following name accordingly: Vectorize
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function Vectorize
logging.error("An error occurred during Vectorization: ", exc_info=e) | ||
raise e | ||
|
||
def Mean_Vector(self, vectors: list) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable and function names should follow the snake_case
naming convention. Please update the following name accordingly: Mean_Vector
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function Mean_Vector
) | ||
raise e | ||
|
||
def Dot_Product(self, vector1: np.ndarray, vector2: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable and function names should follow the snake_case
naming convention. Please update the following name accordingly: Dot_Product
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function Dot_Product
) | ||
raise e | ||
|
||
def Magnitude(self, vector: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable and function names should follow the snake_case
naming convention. Please update the following name accordingly: Magnitude
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function Magnitude
) | ||
raise e | ||
|
||
def Cosine_Similarity(self, vector1: np.ndarray, vector2: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable and function names should follow the snake_case
naming convention. Please update the following name accordingly: Cosine_Similarity
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function Cosine_Similarity
) | ||
raise e | ||
|
||
def Cosine_Similarity_Percentage(self, text1: str, text2: str) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable and function names should follow the snake_case
naming convention. Please update the following name accordingly: Cosine_Similarity_Percentage
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function Cosine_Similarity_Percentage
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper
commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper review
to trigger the checks for only added pull request files@algorithms-keeper review-all
to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
import numpy as np | ||
|
||
|
||
class cosine_similarity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class names should follow the CamelCase
naming convention. Please update the following name accordingly: cosine_similarity
""" | ||
self.nlp = spacy.load("en_core_web_md") | ||
|
||
def tokenize(self, text: str) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function tokenize
logging.error("An error occurred during Tokenization: ", exc_info=e) | ||
raise e | ||
|
||
def vectorize(self, tokens: list) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function vectorize
logging.error("An error occurred during Vectorization: ", exc_info=e) | ||
raise e | ||
|
||
def mean_vector(self, vectors: list) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function mean_vector
) | ||
raise e | ||
|
||
def dot_product(self, vector1: np.ndarray, vector2: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function dot_product
) | ||
raise e | ||
|
||
def magnitude(self, vector: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function magnitude
) | ||
raise e | ||
|
||
def cosine_similarity(self, vector1: np.ndarray, vector2: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function cosine_similarity
) | ||
raise e | ||
|
||
def cosine_similarity_percentage(self, text1: str, text2: str) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function cosine_similarity_percentage
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper
commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper review
to trigger the checks for only added pull request files@algorithms-keeper review-all
to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
""" | ||
self.nlp = spacy.load("en_core_web_md") | ||
|
||
def tokenize(self, text: str) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function tokenize
logging.error("An error occurred during Tokenization: ", exc_info=e) | ||
raise e | ||
|
||
def vectorize(self, tokens: list) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function vectorize
logging.error("An error occurred during Vectorization: ", exc_info=e) | ||
raise e | ||
|
||
def mean_vector(self, vectors: list) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function mean_vector
) | ||
raise e | ||
|
||
def dot_product(self, vector1: np.ndarray, vector2: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function dot_product
) | ||
raise e | ||
|
||
def magnitude(self, vector: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function magnitude
) | ||
raise e | ||
|
||
def cosine_similarity(self, vector1: np.ndarray, vector2: np.ndarray) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function cosine_similarity
) | ||
raise e | ||
|
||
def cosine_similarity_percentage(self, text1: str, text2: str) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no test file in this pull request nor any test function or class in the file machine_learning/cosine_similarity.py
, please provide doctest for the function cosine_similarity_percentage
for more information, see https://pre-commit.ci
Cosine Similarity Algorithm
Overview
Introduces a New Implementation of the Cosine Similarity Algorithm in the
Cosine_Similarity
class. Cosine Similarity is a widely used metric inNatural Language Processing
and Information retrieval to measure the similarity between two texts based on their Vector Representations.Key Features
SpaCy's
pre-trainedWord Embeddings
to convert text into Vectors.SpaCy's
embeddings.Mean Vector
for a set of Word Vectors to represent the overall text.Similarity Score
ranging from -1 to 1.Mathematical Foundation
Dot Product: Measures the Degree of Alignment between two Vectors.
Magnitude (Norm): Computes the length of a Vector.
Cosine Similarity Formula:
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.
Usage
The
Cosine_Similarity
class provides methods toTokenize
,Vectorize
, and calculate theCosine Similarity
between two pieces of text. It includes:Tokenize(text)
: Tokenizes the input text into lowercase tokens.Vectorize(tokens)
: Converts tokens into vector representations.Mean_Vector(vectors)
: Computes the average vector of a list of vectors.Dot_Product(vector1, vector2)
: Calculates the dot product of two vectors.Magnitude(vector)
: Computes the magnitude of a vector.Cosine_Similarity(vector1, vector2)
: Computes the cosine similarity between two vectors.Cosine_Similarity_Percentage(text1, text2)
: Calculates the similarity percentage between two texts.Error Handling
Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.
Benefits