-
-
Notifications
You must be signed in to change notification settings - Fork 46.6k
Add bitap_string_match algo #11060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add bitap_string_match algo #11060
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,79 @@ | ||||||
""" | ||||||
Bitap exact string matching | ||||||
https://en.wikipedia.org/wiki/Bitap_algorithm | ||||||
|
||||||
Searches for a pattern inside text, and returns the index of the first occurence | ||||||
of the pattern. Both text and pattern consist of lowercase alphabetical characters only. | ||||||
|
||||||
Complexity: O(m*n) | ||||||
n = length of text | ||||||
m = length of pattern | ||||||
|
||||||
Python doctests can be run using this command: | ||||||
python3 -m doctest -v bitap_string_match.py | ||||||
""" | ||||||
|
||||||
|
||||||
def bitap_string_match(text: str, pattern: str) -> int | None: | ||||||
""" | ||||||
Retrieves the index of the first occurrence of pattern in text. | ||||||
|
||||||
Args: | ||||||
text: A string consisting only of lowercase alphabetical characters. | ||||||
pattern: A string consisting only of lowercase alphabetical characters. | ||||||
|
||||||
Returns: | ||||||
int: The index where pattern first occurs. | ||||||
|
||||||
>>> bitap_string_match('abdabababc', 'ababc') | ||||||
5 | ||||||
>>> bitap_string_match('aaaaaaaaaaaaaaaaaa', 'a') | ||||||
0 | ||||||
>>> bitap_string_match('zxywsijdfosdfnso', 'zxywsijdfosdfnso') | ||||||
0 | ||||||
>>> bitap_string_match('abdabababc', '') | ||||||
0 | ||||||
>>> bitap_string_match('abdabababc', 'c') | ||||||
9 | ||||||
>>> bitap_string_match('abdabababc', 'fofosdfo') is None | ||||||
True | ||||||
>>> bitap_string_match('abdab', 'fofosdfo') is None | ||||||
True | ||||||
""" | ||||||
m: int = len(pattern) | ||||||
if m == 0: | ||||||
return 0 | ||||||
if m > len(text): | ||||||
return None | ||||||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
# Initial state of bit string 1110 | ||||||
state: int = ~1 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
# Bit = 0 if character appears at index, and 1 otherwise | ||||||
pattern_mask: list[int] = [~0] * 27 # 1111 | ||||||
|
||||||
for i in range(m): | ||||||
# For the pattern mask for this character, set the bit to 0 for each i | ||||||
# the character appears. | ||||||
pattern_index: int = ord(pattern[i]) - ord("a") | ||||||
pattern_mask[pattern_index] &= ~(1 << i) | ||||||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
for i in range(len(text)): | ||||||
text_index: int = ord(text[i]) - ord("a") | ||||||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# If this character does not appear in pattern, it's pattern mask is 1111. | ||||||
# Performing a bitwise OR between state and 1111 will reset the state to 1111 | ||||||
# and start searching the start of pattern again. | ||||||
state |= pattern_mask[text_index] | ||||||
state <<= 1 | ||||||
|
||||||
# If the mth bit (counting right to left) of the state is 0, then we have | ||||||
# found pattern in text | ||||||
if (state & (1 << m)) == 0: | ||||||
return i - m + 1 | ||||||
|
||||||
return None | ||||||
|
||||||
|
||||||
if __name__ == "__main__": | ||||||
import doctest | ||||||
|
||||||
doctest.testmod() |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All falsey values for
pattern
are equally dangerous as""
.Python and mypy are both smart enough to figure out without the type hint that if
len(pattern)
is assigned tom
thenm
is an int.