Skip to content

Bloom Filter #8615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Apr 8, 2023
Merged
Changes from 20 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
173ab0e
Bloom filter with tests
isidroas Apr 6, 2023
08bc970
has functions constant
isidroas Apr 6, 2023
0448109
fix type
isidroas Apr 6, 2023
486dcbc
isort
isidroas Apr 6, 2023
4111807
passing ruff
isidroas Apr 6, 2023
e6ce098
type hints
isidroas Apr 6, 2023
e4d39db
type hints
isidroas Apr 6, 2023
7629686
from fail to erro
isidroas Apr 6, 2023
3926167
captital leter
isidroas Apr 6, 2023
280ffa0
type hints requested by boot
isidroas Apr 6, 2023
5d460aa
descriptive name for m
isidroas Apr 6, 2023
cc54095
more descriptibe arguments II
isidroas Apr 6, 2023
78d19fd
moved movies_test to doctest
isidroas Apr 7, 2023
8b1bec0
commented doctest
isidroas Apr 7, 2023
28e6691
removed test_probability
isidroas Apr 7, 2023
2fd7196
estimated error
isidroas Apr 7, 2023
314237d
added types
isidroas Apr 7, 2023
9b01472
again hash_
isidroas Apr 7, 2023
c132d50
Update data_structures/hashing/bloom_filter.py
isidroas Apr 8, 2023
313c80c
from b to bloom
isidroas Apr 8, 2023
18e0dde
Update data_structures/hashing/bloom_filter.py
isidroas Apr 8, 2023
54041ff
Update data_structures/hashing/bloom_filter.py
isidroas Apr 8, 2023
483a2a0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 8, 2023
174ce08
syntax error in dict comprehension
isidroas Apr 8, 2023
00cc60e
from goodfather to godfather
isidroas Apr 8, 2023
35fa5f5
removed Interestellar
isidroas Apr 8, 2023
5cd20ea
forgot the last Godfather
isidroas Apr 8, 2023
7617143
Revert "removed Interestellar"
isidroas Apr 8, 2023
799171a
pretty dict
isidroas Apr 8, 2023
1a71f4c
Apply suggestions from code review
cclauss Apr 8, 2023
4e0263f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 8, 2023
e746746
Update bloom_filter.py
cclauss Apr 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions data_structures/hashing/bloom_filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
"""
See https://en.wikipedia.org/wiki/Bloom_filter

The use of this data structure is to test membership in a set.
Compared to Python's built-in set() it is more space-efficient.
In the following example, only 8 bits of memory will be used:
>>> bloom = Bloom(size=8)
>>> "Titanic" in bloom
False

Initially the filter contains all zeros:
>>> bloom.bitstring
'00000000'

When an element is added, two bits are set to 1
since there are 2 hash functions in this implementation:
>>> bloom.add("Titanic")
>>> bloom.bitstring
'01100000'
>>> "Titanic" in bloom
True

However, sometimes only one bit is added
because both hash functions return the same value
>>> bloom.add("Avatar")
>>> bloom.format_hash("Avatar")
'00000100'
>>> bloom.bitstring
'01100100'

Not added elements should return False ...
>>> "The Goodfather" in bloom
False
>>> bloom.format_hash("The Goodfather")
'00011000'
>>> "Interstellar" in bloom
False
>>> bloom.format_hash("Interstellar")
'00000011'
>>> "Parasite" in bloom
False
>>> bloom.format_hash("Parasite")
'00010010'
>>> "Pulp Fiction" in bloom
False
>>> bloom.format_hash("Pulp Fiction")
'10000100'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be The Goodfather or The Godfather? https://www.imdb.com/title/tt0068646

Suggested change
Not added elements should return False ...
>>> "The Goodfather" in bloom
False
>>> bloom.format_hash("The Goodfather")
'00011000'
>>> "Interstellar" in bloom
False
>>> bloom.format_hash("Interstellar")
'00000011'
>>> "Parasite" in bloom
False
>>> bloom.format_hash("Parasite")
'00010010'
>>> "Pulp Fiction" in bloom
False
>>> bloom.format_hash("Pulp Fiction")
'10000100'
Not added elements should return False ...
>>> not_present_films = ("The Goodfather", "Interstellar", "Parasite", "Pulp Fiction")
>>> {film: bloom.format_hash(film) for film in not_present_films)}
{'The Goodfather': '00011000', 'Interstellar': '00000011', 'Parasite': '00010010': 'Pulp Fiction': '10000100'}
>>> any(film in bloom for film in not_present_films)
False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to reduce the tuple in order to avoid the following ruff error:
2023-04-08-171811_

Also tried pretty-print, but it doesn't match

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for teaching me that.

I also had to divide in multiple lines the previous line:
2023-04-08-184822_

2023-04-08-185453_


but sometimes there are false positives:
>>> "Ratatouille" in bloom
True
>>> bloom.format_hash("Ratatouille")
'01100000'

The probability increases with the number of added elements
>>> bloom.estimated_error_rate()
0.140625
>>> bloom.add("The Goodfather")
>>> bloom.estimated_error_rate()
0.390625
>>> bloom.bitstring
'01111100'
"""
from hashlib import md5, sha256

HASH_FUNCTIONS = (sha256, md5)


class Bloom:
def __init__(self, size: int = 8) -> None:
self.bitarray = 0b0
self.size = size

def add(self, value: str) -> None:
h = self.hash_(value)
self.bitarray |= h

def exists(self, value: str) -> bool:
h = self.hash_(value)
return (h & self.bitarray) == h

def __contains__(self, other: str) -> bool:
return self.exists(other)

def format_bin(self, bitarray: int) -> str:
res = bin(bitarray)[2:]
return res.zfill(self.size)

@property
def bitstring(self) -> str:
return self.format_bin(self.bitarray)

def hash_(self, value: str) -> int:
res = 0b0
for func in HASH_FUNCTIONS:
b = func(value.encode()).digest()
position = int.from_bytes(b, "little") % self.size
res |= 2**position
return res

def format_hash(self, value: str) -> str:
return self.format_bin(self.hash_(value))

def estimated_error_rate(self) -> float:
n_ones = bin(self.bitarray).count("1")
k = len(HASH_FUNCTIONS)
return (n_ones / self.size) ** k