Cosine Similarity Algorithm

Arko-Sengupta · 2024-09-03T09:48:02Z

Overview

Introduces a New Implementation of the Cosine Similarity Algorithm in the Cosine_Similarity class. Cosine Similarity is a widely used metric in Natural Language Processing and Information retrieval to measure the similarity between two texts based on their Vector Representations.

Key Features

Vector Representation: Utilizes SpaCy's pre-trained Word Embeddings to convert text into Vectors.
Tokenization: Breaks down input text into lowercased tokens, excluding punctuation.
Vectorization: Converts tokens into their corresponding vectors using SpaCy's embeddings.
Mean Vector Calculation: Computes the Mean Vector for a set of Word Vectors to represent the overall text.
Cosine Similarity Calculation: Measures the cosine of the angle between two vectors, providing a Similarity Score ranging from -1 to 1.
Cosine Similarity Percentage: Outputs the similarity score as a percentage, facilitating easier interpretation.

Mathematical Foundation

Dot Product: Measures the Degree of Alignment between two Vectors.
Magnitude (Norm): Computes the length of a Vector.
Cosine Similarity Formula:
```
 Cosine Similarity = (Dot Product) / (Magnitude_1 * Magnitude_2)
```
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.

Usage

The Cosine_Similarity class provides methods to Tokenize, Vectorize, and calculate the Cosine Similarity between two pieces of text. It includes:

Tokenize(text): Tokenizes the input text into lowercase tokens.
Vectorize(tokens): Converts tokens into vector representations.
Mean_Vector(vectors): Computes the average vector of a list of vectors.
Dot_Product(vector1, vector2): Calculates the dot product of two vectors.
Magnitude(vector): Computes the magnitude of a vector.
Cosine_Similarity(vector1, vector2): Computes the cosine similarity between two vectors.
Cosine_Similarity_Percentage(text1, text2): Calculates the similarity percentage between two texts.

Error Handling

Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.

Benefits

Provides an effective method for comparing textual content.
Leverages pre-trained embeddings for accurate and efficient similarity measurement.
Can be used in various applications including document similarity, search relevance, and recommendation systems.

for more information, see https://pre-commit.ci

algorithms-keeper · 2024-09-03T09:48:06Z

Closing this pull request as invalid

@Arko-Sengupta, this pull request is being closed as none of the checkboxes have been marked. It is important that you go through the checklist and mark the ones relevant to this pull request. Please read the Contributing guidelines.

If you're facing any problem on how to mark a checkbox, please read the following instructions:

Read a point one at a time and think if it is relevant to the pull request or not.
If it is, then mark it by putting a x between the square bracket like so: [x]

NOTE: Only [x] is supported so if you have put any other letter or symbol between the brackets, that will be marked as invalid. If that is the case then please open a new pull request with the appropriate changes.

Arko-Sengupta and others added 10 commits September 3, 2024 13:34

Cosine Similarity Algorithm | Machine Learning

0af293b

[pre-commit.ci] auto fixes from pre-commit.com hooks

3a62339

for more information, see https://pre-commit.ci

Input Fixes

e8ec6df

Input Fixes

1458803

[pre-commit.ci] auto fixes from pre-commit.com hooks

030ced3

for more information, see https://pre-commit.ci

Lower Case Fixes

768015c

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8deb03

for more information, see https://pre-commit.ci

Case Fixes

d597f45

Case Fixes

2479eef

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa91225

for more information, see https://pre-commit.ci

algorithms-keeper bot added the invalid label Sep 3, 2024

algorithms-keeper bot closed this Sep 3, 2024

algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosine Similarity Algorithm | Machine Learning #11537

Cosine Similarity Algorithm | Machine Learning #11537

Arko-Sengupta commented Sep 3, 2024

algorithms-keeper bot commented Sep 3, 2024

Cosine Similarity Algorithm | Machine Learning #11537

Cosine Similarity Algorithm | Machine Learning #11537

Conversation

Arko-Sengupta commented Sep 3, 2024