Skip to content

Cosine Similarity Algorithm | Machine Learning #11539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from
Closed

Cosine Similarity Algorithm | Machine Learning #11539

wants to merge 35 commits into from

Conversation

Arko-Sengupta
Copy link

Cosine Similarity Algorithm

Overview

Introduces a New Implementation of the Cosine Similarity Algorithm in the Cosine_Similarity class. Cosine Similarity is a widely used metric in Natural Language Processing and Information retrieval to measure the similarity between two texts based on their Vector Representations.

Key Features

  • Vector Representation: Utilizes SpaCy's pre-trained Word Embeddings to convert text into Vectors.
  • Tokenization: Breaks down input text into lowercased tokens, excluding punctuation.
  • Vectorization: Converts tokens into their corresponding vectors using SpaCy's embeddings.
  • Mean Vector Calculation: Computes the Mean Vector for a set of Word Vectors to represent the overall text.
  • Cosine Similarity Calculation: Measures the cosine of the angle between two vectors, providing a Similarity Score ranging from -1 to 1.
  • Cosine Similarity Percentage: Outputs the similarity score as a percentage, facilitating easier interpretation.

Mathematical Foundation

  • Dot Product: Measures the Degree of Alignment between two Vectors.

  • Magnitude (Norm): Computes the length of a Vector.

  • Cosine Similarity Formula:

     Cosine Similarity = (Dot Product) / (Magnitude_1 * Magnitude_2)
    

    where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.

Usage

The Cosine_Similarity class provides methods to Tokenize, Vectorize, and calculate the Cosine Similarity between two pieces of text. It includes:

  • Tokenize(text): Tokenizes the input text into lowercase tokens.
  • Vectorize(tokens): Converts tokens into vector representations.
  • Mean_Vector(vectors): Computes the average vector of a list of vectors.
  • Dot_Product(vector1, vector2): Calculates the dot product of two vectors.
  • Magnitude(vector): Computes the magnitude of a vector.
  • Cosine_Similarity(vector1, vector2): Computes the cosine similarity between two vectors.
  • Cosine_Similarity_Percentage(text1, text2): Calculates the similarity percentage between two texts.

Error Handling

Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.

Benefits

  • Provides an effective method for comparing textual content.
  • Leverages pre-trained embeddings for accurate and efficient similarity measurement.
  • Can be used in various applications including document similarity, search relevance, and recommendation systems.

@algorithms-keeper algorithms-keeper bot added require tests Tests [doctest/unittest/pytest] are required require type hints https://docs.python.org/3/library/typing.html labels Sep 4, 2024
Copy link

@algorithms-keeper algorithms-keeper bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

@algorithms-keeper algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 4, 2024
@algorithms-keeper algorithms-keeper bot added the tests are failing Do not merge until tests pass label Sep 11, 2024
@algorithms-keeper algorithms-keeper bot removed require tests Tests [doctest/unittest/pytest] are required require type hints https://docs.python.org/3/library/typing.html labels Sep 11, 2024
@algorithms-keeper algorithms-keeper bot removed the tests are failing Do not merge until tests pass label Sep 11, 2024
@algorithms-keeper algorithms-keeper bot added the tests are failing Do not merge until tests pass label Sep 11, 2024
@algorithms-keeper algorithms-keeper bot removed the tests are failing Do not merge until tests pass label Sep 11, 2024
@QuantumNovice
Copy link
Contributor

The algorithm in question is already in the repo at

def cosine_similarity(input_a: np.ndarray, input_b: np.ndarray) -> float:

Please rename to document similarity and double check to see if it doesn't already exists in the repository.

@Arko-Sengupta
Copy link
Author

Arko-Sengupta commented Sep 13, 2024

@QuantumNovice Resolved. Thanks for Raising!

@Arko-Sengupta Arko-Sengupta reopened this Sep 13, 2024
@algorithms-keeper algorithms-keeper bot added the enhancement This PR modified some existing files label Sep 13, 2024
@Arko-Sengupta Arko-Sengupta closed this by deleting the head repository Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed enhancement This PR modified some existing files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants