Skip to content

adding jaccard similarity #1270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 4, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions maths/jaccard_similarity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
"""
The Jaccard similarity coefficient is a commonly used indicator of the
similarity between two sets. Let U be a set and A and B be subsets of U,
then the Jaccard index/similarity is defined to be the ratio of the number
of elements of their intersection and the number of elements of their union.
Inspired from Wikipedia and
the book Mining of Massive Datasets [MMDS 2nd Edition, Chapter 3]
https://en.wikipedia.org/wiki/Jaccard_index
https://mmds.org
Jaccard similarity is widely used with MinHashing.
"""


def jaccard_similariy(setA, setB, alternativeUnion=False):
"""
Finds the jaccard similarity between two sets.
Essentially, its intersection over union.
The alternative way to calculate this is to take union as sum of the
number of items in the two sets. This will lead to jaccard similarity
of a set with itself be 1/2 instead of 1. [MMDS 2nd Edition, Page 77]
Parameters:
:setA (set,list,tuple): A non-empty set/list
:setB (set,list,tuple): A non-empty set/list
:alternativeUnion (boolean): If True, use sum of number of
items as union
Output:
(float) The jaccard similarity between the two sets.
Examples:
>>> setA = {'a', 'b', 'c', 'd', 'e'}
>>> setB = {'c', 'd', 'e', 'f', 'h', 'i'}
>>> jaccard_similariy(setA,setB)
0.375
>>> jaccard_similariy(setA,setA)
1.0
>>> jaccard_similariy(setA,setA,True)
0.5
>>> setA = ['a', 'b', 'c', 'd', 'e']
>>> setB = ('c', 'd', 'e', 'f', 'h', 'i')
>>> jaccard_similariy(setA,setB)
0.375
"""

if isinstance(setA, set) and isinstance(setB, set):

intersection = len(setA.intersection(setB))

if alternativeUnion:
union = len(setA) + len(setB)
else:
union = len(setA.union(setB))

return intersection / union

if isinstance(setA, (list, tuple)) and isinstance(setB, (list, tuple)):

intersection = [element for element in setA if element in setB]

if alternativeUnion:
union = len(setA) + len(setB)
else:
union = setA + [element for element in setB if element not in setA]

return len(intersection) / len(union)


if __name__ == "__main__":

setA = {"a", "b", "c", "d", "e"}
setB = {"c", "d", "e", "f", "h", "i"}
print(jaccard_similariy(setA, setB))
File renamed without changes.