-
-
Notifications
You must be signed in to change notification settings - Fork 46.6k
Cosine Similarity Algorithm | Machine Learning #11539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper
commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper review
to trigger the checks for only added pull request files@algorithms-keeper review-all
to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
The algorithm in question is already in the repo at Python/machine_learning/similarity_search.py Line 143 in 729c1f9
Please rename to document similarity and double check to see if it doesn't already exists in the repository. |
@QuantumNovice Resolved. Thanks for Raising! |
Cosine Similarity Algorithm
Overview
Introduces a New Implementation of the Cosine Similarity Algorithm in the
Cosine_Similarity
class. Cosine Similarity is a widely used metric inNatural Language Processing
and Information retrieval to measure the similarity between two texts based on their Vector Representations.Key Features
SpaCy's
pre-trainedWord Embeddings
to convert text into Vectors.SpaCy's
embeddings.Mean Vector
for a set of Word Vectors to represent the overall text.Similarity Score
ranging from -1 to 1.Mathematical Foundation
Dot Product: Measures the Degree of Alignment between two Vectors.
Magnitude (Norm): Computes the length of a Vector.
Cosine Similarity Formula:
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.
Usage
The
Cosine_Similarity
class provides methods toTokenize
,Vectorize
, and calculate theCosine Similarity
between two pieces of text. It includes:Tokenize(text)
: Tokenizes the input text into lowercase tokens.Vectorize(tokens)
: Converts tokens into vector representations.Mean_Vector(vectors)
: Computes the average vector of a list of vectors.Dot_Product(vector1, vector2)
: Calculates the dot product of two vectors.Magnitude(vector)
: Computes the magnitude of a vector.Cosine_Similarity(vector1, vector2)
: Computes the cosine similarity between two vectors.Cosine_Similarity_Percentage(text1, text2)
: Calculates the similarity percentage between two texts.Error Handling
Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.
Benefits