-
-
Notifications
You must be signed in to change notification settings - Fork 46.6k
Add bitap_string_match algo #11060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bitap_string_match algo #11060
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,79 @@ | ||||||||||||||
""" | ||||||||||||||
Bitap exact string matching | ||||||||||||||
https://en.wikipedia.org/wiki/Bitap_algorithm | ||||||||||||||
|
||||||||||||||
Searches for a pattern inside text, and returns the index of the first occurrence | ||||||||||||||
of the pattern. Both text and pattern consist of lowercase alphabetical characters only. | ||||||||||||||
|
||||||||||||||
Complexity: O(m*n) | ||||||||||||||
n = length of text | ||||||||||||||
m = length of pattern | ||||||||||||||
|
||||||||||||||
Python doctests can be run using this command: | ||||||||||||||
python3 -m doctest -v bitap_string_match.py | ||||||||||||||
""" | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
def bitap_string_match(text: str, pattern: str) -> int: | ||||||||||||||
""" | ||||||||||||||
Retrieves the index of the first occurrence of pattern in text. | ||||||||||||||
|
||||||||||||||
Args: | ||||||||||||||
text: A string consisting only of lowercase alphabetical characters. | ||||||||||||||
pattern: A string consisting only of lowercase alphabetical characters. | ||||||||||||||
|
||||||||||||||
Returns: | ||||||||||||||
int: The index where pattern first occurs. Return -1 if not found. | ||||||||||||||
|
||||||||||||||
>>> bitap_string_match('abdabababc', 'ababc') | ||||||||||||||
5 | ||||||||||||||
>>> bitap_string_match('aaaaaaaaaaaaaaaaaa', 'a') | ||||||||||||||
0 | ||||||||||||||
>>> bitap_string_match('zxywsijdfosdfnso', 'zxywsijdfosdfnso') | ||||||||||||||
0 | ||||||||||||||
>>> bitap_string_match('abdabababc', '') | ||||||||||||||
0 | ||||||||||||||
>>> bitap_string_match('abdabababc', 'c') | ||||||||||||||
9 | ||||||||||||||
>>> bitap_string_match('abdabababc', 'fofosdfo') | ||||||||||||||
-1 | ||||||||||||||
>>> bitap_string_match('abdab', 'fofosdfo') | ||||||||||||||
-1 | ||||||||||||||
""" | ||||||||||||||
m: int = len(pattern) | ||||||||||||||
if m == 0: | ||||||||||||||
return 0 | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All falsey values for Python and mypy are both smart enough to figure out without the type hint that if
Suggested change
|
||||||||||||||
if m > len(text): | ||||||||||||||
return -1 | ||||||||||||||
|
||||||||||||||
# Initial state of bit string 1110 | ||||||||||||||
state: int = ~1 | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
# Bit = 0 if character appears at index, and 1 otherwise | ||||||||||||||
pattern_mask: list[int] = [~0] * 27 # 1111 | ||||||||||||||
|
||||||||||||||
for i, char in enumerate(pattern): | ||||||||||||||
# For the pattern mask for this character, set the bit to 0 for each i | ||||||||||||||
# the character appears. | ||||||||||||||
pattern_index: int = ord(char) - ord("a") | ||||||||||||||
pattern_mask[pattern_index] &= ~(1 << i) | ||||||||||||||
|
||||||||||||||
for i, char in enumerate(text): | ||||||||||||||
text_index = ord(char) - ord("a") | ||||||||||||||
# If this character does not appear in pattern, it's pattern mask is 1111. | ||||||||||||||
# Performing a bitwise OR between state and 1111 will reset the state to 1111 | ||||||||||||||
# and start searching the start of pattern again. | ||||||||||||||
state |= pattern_mask[text_index] | ||||||||||||||
state <<= 1 | ||||||||||||||
|
||||||||||||||
# If the mth bit (counting right to left) of the state is 0, then we have | ||||||||||||||
# found pattern in text | ||||||||||||||
if (state & (1 << m)) == 0: | ||||||||||||||
return i - m + 1 | ||||||||||||||
|
||||||||||||||
return -1 | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
if __name__ == "__main__": | ||||||||||||||
import doctest | ||||||||||||||
|
||||||||||||||
doctest.testmod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list is in alphabetical order to make it easy to spot missing values and almost impossible to add duplicates.