Cosine Similarity Algorithm

Arko-Sengupta · 2024-09-03T10:19:50Z

Overview

Introduces a New Implementation of the Cosine Similarity Algorithm in the Cosine_Similarity class. Cosine Similarity is a widely used metric in Natural Language Processing and Information retrieval to measure the similarity between two texts based on their Vector Representations.

Key Features

Vector Representation: Utilizes SpaCy's pre-trained Word Embeddings to convert text into Vectors.
Tokenization: Breaks down input text into lowercased tokens, excluding punctuation.
Vectorization: Converts tokens into their corresponding vectors using SpaCy's embeddings.
Mean Vector Calculation: Computes the Mean Vector for a set of Word Vectors to represent the overall text.
Cosine Similarity Calculation: Measures the cosine of the angle between two vectors, providing a Similarity Score ranging from -1 to 1.
Cosine Similarity Percentage: Outputs the similarity score as a percentage, facilitating easier interpretation.

Mathematical Foundation

Dot Product: Measures the Degree of Alignment between two Vectors.
Magnitude (Norm): Computes the length of a Vector.
Cosine Similarity Formula:
```
 Cosine Similarity = (Dot Product) / (Magnitude_1 * Magnitude_2)
```
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.

Usage

The Cosine_Similarity class provides methods to Tokenize, Vectorize, and calculate the Cosine Similarity between two pieces of text. It includes:

Tokenize(text): Tokenizes the input text into lowercase tokens.
Vectorize(tokens): Converts tokens into vector representations.
Mean_Vector(vectors): Computes the average vector of a list of vectors.
Dot_Product(vector1, vector2): Calculates the dot product of two vectors.
Magnitude(vector): Computes the magnitude of a vector.
Cosine_Similarity(vector1, vector2): Computes the cosine similarity between two vectors.
Cosine_Similarity_Percentage(text1, text2): Calculates the similarity percentage between two texts.

Error Handling

Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.

Benefits

Provides an effective method for comparing textual content.
Leverages pre-trained embeddings for accurate and efficient similarity measurement.
Can be used in various applications including document similarity, search relevance, and recommendation systems.

for more information, see https://pre-commit.ci

requirements.txt

for more information, see https://pre-commit.ci

…hms-Python-Open-Source

algorithms-keeper

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Contributing guidelines

Project Euler solution guidelines

Python:

Formatted string literals (f-strings)

Type hints

doctest

unittest

pytest

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

@algorithms-keeper review to trigger the checks for only added pull request files

@algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

machine_learning/cosine_similarity.py

for more information, see https://pre-commit.ci

…hms-Python-Open-Source

for more information, see https://pre-commit.ci

…hms-Python-Open-Source

for more information, see https://pre-commit.ci

QuantumNovice · 2024-09-13T05:41:33Z

The algorithm in question is already in the repo at

Python/machine_learning/similarity_search.py

Line 143 in 729c1f9

def cosine_similarity(input_a: np.ndarray, input_b: np.ndarray) -> float:

Please rename to document similarity and double check to see if it doesn't already exists in the repository.

Arko-Sengupta · 2024-09-13T06:47:31Z

@QuantumNovice Resolved. Thanks for Raising!

Arko-Sengupta and others added 11 commits September 3, 2024 13:34

Cosine Similarity Algorithm | Machine Learning

0af293b

[pre-commit.ci] auto fixes from pre-commit.com hooks

3a62339

for more information, see https://pre-commit.ci

Input Fixes

e8ec6df

Input Fixes

1458803

[pre-commit.ci] auto fixes from pre-commit.com hooks

030ced3

for more information, see https://pre-commit.ci

Lower Case Fixes

768015c

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8deb03

for more information, see https://pre-commit.ci

Case Fixes

d597f45

Case Fixes

2479eef

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa91225

for more information, see https://pre-commit.ci

spaCy Fixes

1b87ff9

pahalsrivastava reviewed Sep 3, 2024

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

Arko-Sengupta and others added 4 commits September 4, 2024 15:57

Fixed Model Dependency

2fe680f

[pre-commit.ci] auto fixes from pre-commit.com hooks

0336893

for more information, see https://pre-commit.ci

Fixed Model Dependency

522edab

Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…

135d9ea

…hms-Python-Open-Source

algorithms-keeper bot added require tests Tests [doctest/unittest/pytest] are required require type hints https://docs.python.org/3/library/typing.html labels Sep 4, 2024

algorithms-keeper bot reviewed Sep 4, 2024

View reviewed changes

algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 4, 2024

Arko-Sengupta requested a review from pahalsrivastava September 4, 2024 10:40

pahalsrivastava approved these changes Sep 5, 2024

View reviewed changes

Arko-Sengupta requested a review from pahalsrivastava September 11, 2024 07:02

Arko-Sengupta and others added 2 commits September 11, 2024 13:07

Resolved All Doctests

4cbeb62

[pre-commit.ci] auto fixes from pre-commit.com hooks

70a6de4

for more information, see https://pre-commit.ci

algorithms-keeper bot added the tests are failing Do not merge until tests pass label Sep 11, 2024

Arko-Sengupta and others added 4 commits September 11, 2024 13:29

Resolved all DocTests

c892be4

Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…

89aef3c

…hms-Python-Open-Source

[pre-commit.ci] auto fixes from pre-commit.com hooks

26c7117

for more information, see https://pre-commit.ci

Resolved All Dependencies

547e538

Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…

7158e47

…hms-Python-Open-Source

algorithms-keeper bot removed require tests Tests [doctest/unittest/pytest] are required require type hints https://docs.python.org/3/library/typing.html labels Sep 11, 2024

Arko-Sengupta and others added 8 commits September 11, 2024 13:45

Resolved Dependency in DocTest

b1738d9

Resolved Dependency from All Methods

4d94aaf

Loaded Package at a Time

e0f24f2

Cleared All Dependencies

3a3f30c

[pre-commit.ci] auto fixes from pre-commit.com hooks

147bcb2

for more information, see https://pre-commit.ci

Cleared All Dependencies

d320b99

Resolved Package OS Error

2aa3608

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c15055

for more information, see https://pre-commit.ci

algorithms-keeper bot removed the tests are failing Do not merge until tests pass label Sep 11, 2024

Arko-Sengupta and others added 3 commits September 11, 2024 14:44

Merge branch 'TheAlgorithms:master' into master

90c4446

Jaccard Similarity | Machine Learning

cc4258d

[pre-commit.ci] auto fixes from pre-commit.com hooks

6ebe310

for more information, see https://pre-commit.ci

algorithms-keeper bot added the tests are failing Do not merge until tests pass label Sep 11, 2024

Correct Seperate Algo Conflict

3851df0

algorithms-keeper bot removed the tests are failing Do not merge until tests pass label Sep 11, 2024

Resolved Rename Issue

3435d80

Arko-Sengupta closed this Sep 13, 2024

Arko-Sengupta reopened this Sep 13, 2024

algorithms-keeper bot added the enhancement This PR modified some existing files label Sep 13, 2024

Arko-Sengupta closed this by deleting the head repository Sep 25, 2024

pahalsrivastava approved these changes Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosine Similarity Algorithm | Machine Learning #11539

Cosine Similarity Algorithm | Machine Learning #11539

Arko-Sengupta commented Sep 3, 2024

algorithms-keeper bot left a comment

QuantumNovice commented Sep 13, 2024

Arko-Sengupta commented Sep 13, 2024 •

edited

Loading

Cosine Similarity Algorithm | Machine Learning #11539

Cosine Similarity Algorithm | Machine Learning #11539

Conversation

Arko-Sengupta commented Sep 3, 2024