Skip to content

Add bitap_string_match algo #11060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 28, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions strings/bitap_string_match.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
"""
Bitap exact string matching
https://en.wikipedia.org/wiki/Bitap_algorithm

Searches for a pattern inside text, and returns the index of the first occurence
of the pattern. Both text and pattern consist of lowercase alphabetical characters only.

Complexity: O(m*n)
n = length of text
m = length of pattern

Python doctests can be run using this command:
python3 -m doctest -v bitap_string_match.py
"""


def bitap_string_match(text: str, pattern: str) -> int | None:
"""
Retrieves the index of the first occurrence of pattern in text.

Args:
text: A string consisting only of lowercase alphabetical characters.
pattern: A string consisting only of lowercase alphabetical characters.

Returns:
int: The index where pattern first occurs.

>>> bitap_string_match('abdabababc', 'ababc')
5
>>> bitap_string_match('aaaaaaaaaaaaaaaaaa', 'a')
0
>>> bitap_string_match('zxywsijdfosdfnso', 'zxywsijdfosdfnso')
0
>>> bitap_string_match('abdabababc', '')
0
>>> bitap_string_match('abdabababc', 'c')
9
>>> bitap_string_match('abdabababc', 'fofosdfo') is None
True
>>> bitap_string_match('abdab', 'fofosdfo') is None
True
"""
m: int = len(pattern)
if m == 0:
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All falsey values for pattern are equally dangerous as "".

Python and mypy are both smart enough to figure out without the type hint that if len(pattern) is assigned to m then m is an int.

Suggested change
m: int = len(pattern)
if m == 0:
return 0
if not pattern:
return 0
m = len(pattern)

if m > len(text):
return None

# Initial state of bit string 1110
state: int = ~1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
state: int = ~1
state = ~1

# Bit = 0 if character appears at index, and 1 otherwise
pattern_mask: list[int] = [~0] * 27 # 1111

for i in range(m):
# For the pattern mask for this character, set the bit to 0 for each i
# the character appears.
pattern_index: int = ord(pattern[i]) - ord("a")
pattern_mask[pattern_index] &= ~(1 << i)

for i in range(len(text)):
text_index: int = ord(text[i]) - ord("a")
# If this character does not appear in pattern, it's pattern mask is 1111.
# Performing a bitwise OR between state and 1111 will reset the state to 1111
# and start searching the start of pattern again.
state |= pattern_mask[text_index]
state <<= 1

# If the mth bit (counting right to left) of the state is 0, then we have
# found pattern in text
if (state & (1 << m)) == 0:
return i - m + 1

return None


if __name__ == "__main__":
import doctest

doctest.testmod()