Skip to content

Jaccard Similarity Algorithm | Machine Learning #11559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from
Closed

Jaccard Similarity Algorithm | Machine Learning #11559

wants to merge 35 commits into from

Conversation

Arko-Sengupta
Copy link

Jaccard Similarity Algorithm

Overview

Introduces a New Implementation of the Jaccard Similarity Algorithm in the JaccardSimilarity class. The Jaccard Similarity is a classical metric used in Natural Language Processing and Information Retrieval to measure the similarity between two sets based on their intersection and union.

Key Features

  • Set Representation: Converts input text into sets of words for comparison.
  • Tokenization: Splits the input strings into words based on whitespace.
  • Intersection Calculation: Determines the common elements between two sets of words.
  • Union Calculation: Computes the total unique elements in both sets combined.
  • Jaccard Similarity Calculation: Measures the ratio of the intersection size to the union size, providing a similarity score ranging from 0 to 100%.
  • Percentage Output: Provides the similarity score as a percentage for easier interpretation.

Mathematical Foundation

  • Intersection: The number of elements common to both sets.

  • Union: The total number of unique elements in both sets combined.

  • Jaccard Similarity Formula:

                                Jaccard Similarity = (Size of Intersection) / (Size of Union)
    

where the result is expressed as a percentage, with 100% indicating identical sets and 0% indicating no overlap.

Usage

  • The JaccardSimilarity class provides a method to calculate the similarity between two strings. It includes:
  • jaccard_similarity(str1, str2): Computes the Jaccard similarity between two input strings as a percentage.

Error Handling

Robust Error Handling is implemented to ensure reliable calculations. Any issues, such as empty input strings, are raised with appropriate error messages and logged.

Benefits

  • Provides a straightforward method for comparing text based on set operations.
  • Useful for applications including document similarity, plagiarism detection, and information retrieval.
  • Easy to understand and implement with minimal dependencies.

Arko-Sengupta and others added 30 commits September 3, 2024 13:34
@algorithms-keeper
Copy link

Closing this pull request as invalid

@Arko-Sengupta, this pull request is being closed as none of the checkboxes have been marked. It is important that you go through the checklist and mark the ones relevant to this pull request. Please read the Contributing guidelines.

If you're facing any problem on how to mark a checkbox, please read the following instructions:

  • Read a point one at a time and think if it is relevant to the pull request or not.
  • If it is, then mark it by putting a x between the square bracket like so: [x]

NOTE: Only [x] is supported so if you have put any other letter or symbol between the brackets, that will be marked as invalid. If that is the case then please open a new pull request with the appropriate changes.

@algorithms-keeper algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed invalid
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant