Jaccard Similarity Algorithm

Arko-Sengupta · 2024-09-11T15:14:01Z

Overview

Introduces a New Implementation of the Jaccard Similarity Algorithm in the JaccardSimilarity class. The Jaccard Similarity is a classical metric used in Natural Language Processing and Information Retrieval to measure the similarity between two sets based on their intersection and union.

Key Features

Set Representation: Converts input text into sets of words for comparison.
Tokenization: Splits the input strings into words based on whitespace.
Intersection Calculation: Determines the common elements between two sets of words.
Union Calculation: Computes the total unique elements in both sets combined.
Jaccard Similarity Calculation: Measures the ratio of the intersection size to the union size, providing a similarity score ranging from 0 to 100%.
Percentage Output: Provides the similarity score as a percentage for easier interpretation.

Mathematical Foundation

Intersection: The number of elements common to both sets.
Union: The total number of unique elements in both sets combined.

Jaccard Similarity Formula:

                            Jaccard Similarity = (Size of Intersection) / (Size of Union)

where the result is expressed as a percentage, with 100% indicating identical sets and 0% indicating no overlap.

Usage

The JaccardSimilarity class provides a method to calculate the similarity between two strings. It includes:
jaccard_similarity(str1, str2): Computes the Jaccard similarity between two input strings as a percentage.

Error Handling

Robust Error Handling is implemented to ensure reliable calculations. Any issues, such as empty input strings, are raised with appropriate error messages and logged.

Benefits

Provides a straightforward method for comparing text based on set operations.
Useful for applications including document similarity, plagiarism detection, and information retrieval.
Easy to understand and implement with minimal dependencies.

for more information, see https://pre-commit.ci

…hms-Python-Open-Source

for more information, see https://pre-commit.ci

…hms-Python-Open-Source

for more information, see https://pre-commit.ci

…hms-Python-Open-Source

for more information, see https://pre-commit.ci

algorithms-keeper · 2024-09-11T15:14:05Z

Closing this pull request as invalid

@Arko-Sengupta, this pull request is being closed as none of the checkboxes have been marked. It is important that you go through the checklist and mark the ones relevant to this pull request. Please read the Contributing guidelines.

If you're facing any problem on how to mark a checkbox, please read the following instructions:

Read a point one at a time and think if it is relevant to the pull request or not.
If it is, then mark it by putting a x between the square bracket like so: [x]

NOTE: Only [x] is supported so if you have put any other letter or symbol between the brackets, that will be marked as invalid. If that is the case then please open a new pull request with the appropriate changes.

Arko-Sengupta and others added 30 commits September 3, 2024 13:34

Cosine Similarity Algorithm | Machine Learning

0af293b

[pre-commit.ci] auto fixes from pre-commit.com hooks

3a62339

for more information, see https://pre-commit.ci

Input Fixes

e8ec6df

Input Fixes

1458803

[pre-commit.ci] auto fixes from pre-commit.com hooks

030ced3

for more information, see https://pre-commit.ci

Lower Case Fixes

768015c

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8deb03

for more information, see https://pre-commit.ci

Case Fixes

d597f45

Case Fixes

2479eef

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa91225

for more information, see https://pre-commit.ci

spaCy Fixes

1b87ff9

Fixed Model Dependency

2fe680f

[pre-commit.ci] auto fixes from pre-commit.com hooks

0336893

for more information, see https://pre-commit.ci

Fixed Model Dependency

522edab

Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…

135d9ea

…hms-Python-Open-Source

Resolved All Doctests

4cbeb62

[pre-commit.ci] auto fixes from pre-commit.com hooks

70a6de4

for more information, see https://pre-commit.ci

Resolved all DocTests

c892be4

Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…

89aef3c

…hms-Python-Open-Source

[pre-commit.ci] auto fixes from pre-commit.com hooks

26c7117

for more information, see https://pre-commit.ci

Resolved All Dependencies

547e538

Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…

7158e47

…hms-Python-Open-Source

Resolved Dependency in DocTest

b1738d9

Resolved Dependency from All Methods

4d94aaf

Loaded Package at a Time

e0f24f2

Cleared All Dependencies

3a3f30c

[pre-commit.ci] auto fixes from pre-commit.com hooks

147bcb2

for more information, see https://pre-commit.ci

Cleared All Dependencies

d320b99

Resolved Package OS Error

2aa3608

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c15055

for more information, see https://pre-commit.ci

Arko-Sengupta and others added 5 commits September 11, 2024 14:44

Merge branch 'TheAlgorithms:master' into master

90c4446

Jaccard Similarity | Machine Learning

cc4258d

[pre-commit.ci] auto fixes from pre-commit.com hooks

6ebe310

for more information, see https://pre-commit.ci

Correct Seperate Algo Conflict

3851df0

Jaccard Similarity | Machine Learning

3303efd

algorithms-keeper bot added the invalid label Sep 11, 2024

algorithms-keeper bot closed this Sep 11, 2024

algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaccard Similarity Algorithm | Machine Learning #11559

Jaccard Similarity Algorithm | Machine Learning #11559

Arko-Sengupta commented Sep 11, 2024

algorithms-keeper bot commented Sep 11, 2024

Jaccard Similarity Algorithm | Machine Learning #11559

Jaccard Similarity Algorithm | Machine Learning #11559

Conversation

Arko-Sengupta commented Sep 11, 2024