Cosine Similarity Algorithm | Machine Learning #11537
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cosine Similarity Algorithm
Overview
Introduces a New Implementation of the Cosine Similarity Algorithm in the
Cosine_Similarity
class. Cosine Similarity is a widely used metric inNatural Language Processing
and Information retrieval to measure the similarity between two texts based on their Vector Representations.Key Features
SpaCy's
pre-trainedWord Embeddings
to convert text into Vectors.SpaCy's
embeddings.Mean Vector
for a set of Word Vectors to represent the overall text.Similarity Score
ranging from -1 to 1.Mathematical Foundation
Dot Product: Measures the Degree of Alignment between two Vectors.
Magnitude (Norm): Computes the length of a Vector.
Cosine Similarity Formula:
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.
Usage
The
Cosine_Similarity
class provides methods toTokenize
,Vectorize
, and calculate theCosine Similarity
between two pieces of text. It includes:Tokenize(text)
: Tokenizes the input text into lowercase tokens.Vectorize(tokens)
: Converts tokens into vector representations.Mean_Vector(vectors)
: Computes the average vector of a list of vectors.Dot_Product(vector1, vector2)
: Calculates the dot product of two vectors.Magnitude(vector)
: Computes the magnitude of a vector.Cosine_Similarity(vector1, vector2)
: Computes the cosine similarity between two vectors.Cosine_Similarity_Percentage(text1, text2)
: Calculates the similarity percentage between two texts.Error Handling
Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.
Benefits