-
Notifications
You must be signed in to change notification settings - Fork 19.9k
Add AhoCorasick
#4465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add AhoCorasick
#4465
Changes from 37 commits
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
8bacb96
Added code to find Articulation Points and Bridges
Prabhat-Kumar-42 478bf04
tried to solve clang-formant test
Prabhat-Kumar-42 8e88928
removed new line at EOF to get lint to pass
Prabhat-Kumar-42 db5d447
feature: Added Ahocorasick Algorithm
Prabhat-Kumar-42 2cc9e3f
fixed lint using clang-format
Prabhat-Kumar-42 d2acaf6
removed datastructures/graphs/ArticulationPointsAndBridge.java from t…
Prabhat-Kumar-42 37c92ad
removed main, since test-file is added. Also modified and renamed few…
Prabhat-Kumar-42 efd912d
Added test-file for AhoCorasick Algorithm
Prabhat-Kumar-42 194a37d
Modified some comments in test-file
Prabhat-Kumar-42 288168c
Modified some comments in AhoCorasick.java
Prabhat-Kumar-42 6c4e2c2
lint fix
Prabhat-Kumar-42 ab22511
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 af71647
added few more test cases
Prabhat-Kumar-42 96f6231
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 59dfa0e
Modified some comments
Prabhat-Kumar-42 b135163
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 6a75149
Change all class fields to private, added initializeSuffixLinksForChi…
Prabhat-Kumar-42 dd50c78
Added Missing Test-Cases and more
Prabhat-Kumar-42 f464bf8
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 23b63cb
minor text changes
Prabhat-Kumar-42 42b21da
added direct test check i.e. defining a variable expected and just ch…
Prabhat-Kumar-42 106dfac
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 704b8e1
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 a28df8c
Created New Class Trie, merged 'buildTrie and buildSuffixAndOutputLin…
Prabhat-Kumar-42 b7cc61a
Updated TestFile according to the updated AhoCorasick Class. Added Fe…
Prabhat-Kumar-42 a164aab
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 f5defac
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 a1cbda6
updated - broken down constructor to relavent parts, made string fina…
Prabhat-Kumar-42 f60392b
lint fix clang
Prabhat-Kumar-42 62dfb86
Updated Tests Files
Prabhat-Kumar-42 1528cbf
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 8b9a831
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 69f3aa8
Added final field to Node class setters and Trie Constructor argument…
Prabhat-Kumar-42 fc9ca11
updated test file
Prabhat-Kumar-42 846bbda
lint fix clang
Prabhat-Kumar-42 82586b4
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 6f81843
minor chage - 'removed a comment'
Prabhat-Kumar-42 4fa8df1
added final fields to some arguments, class and variables, added a me…
Prabhat-Kumar-42 4a53103
updated to remove * inclusion and added the required modules only
Prabhat-Kumar-42 653db0b
Implemented a new class PatternPositionRecorder to wrap up the positi…
Prabhat-Kumar-42 a2ec697
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 b409f7a
Added final fields to PatternPositionRecorder Class
Prabhat-Kumar-42 eb1c369
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 b38f067
style: mark default constructor of `AhoCorasick` as `private`
vil02 c7c743b
style: remoce redundant `public`
vil02 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
227 changes: 227 additions & 0 deletions
227
src/main/java/com/thealgorithms/strings/AhoCorasick.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,227 @@ | ||
/* | ||
* Aho-Corasick String Matching Algorithm Implementation | ||
* | ||
* This code implements the Aho-Corasick algorithm, which is used for efficient | ||
* string matching in a given text. It can find multiple patterns simultaneously | ||
* and records their positions in the text. | ||
* | ||
* Author: Prabhat-Kumar-42 | ||
* GitHub: https://github.com/Prabhat-Kumar-42 | ||
*/ | ||
|
||
package com.thealgorithms.strings; | ||
|
||
import java.util.ArrayList; | ||
import java.util.HashMap; | ||
import java.util.LinkedList; | ||
import java.util.Map; | ||
import java.util.Queue; | ||
|
||
public class AhoCorasick { | ||
|
||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// Trie Node Class | ||
public class Node { | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// Represents a character in the trie | ||
private HashMap<Character, Node> child = new HashMap<>(); // Child nodes of the current node | ||
private Node suffixLink; // Suffix link to another node in the trie | ||
private Node outputLink; // Output link to another node in the trie | ||
private int patternInd; // Index of the pattern that ends at this node | ||
|
||
Node() { | ||
this.suffixLink = null; | ||
this.outputLink = null; | ||
this.patternInd = -1; | ||
} | ||
|
||
public HashMap<Character, Node> getChild() { | ||
return child; | ||
} | ||
|
||
public Node getSuffixLink() { | ||
return suffixLink; | ||
} | ||
|
||
public void setSuffixLink(final Node suffixLink) { | ||
this.suffixLink = suffixLink; | ||
} | ||
|
||
public Node getOutputLink() { | ||
return outputLink; | ||
} | ||
|
||
public void setOutputLink(final Node outputLink) { | ||
this.outputLink = outputLink; | ||
} | ||
|
||
public int getPatternInd() { | ||
return patternInd; | ||
} | ||
|
||
public void setPatternInd(final int patternInd) { | ||
this.patternInd = patternInd; | ||
} | ||
} | ||
|
||
// Trie Class | ||
public class Trie { | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
private Node root = null; // Root node of the trie | ||
private final String[] patterns; // patterns according to which Trie is constructed | ||
|
||
public Trie(final String[] patterns) { | ||
root = new Node(); // Initialize the root of the trie | ||
this.patterns = patterns; | ||
buildTrie(); | ||
buildSuffixAndOutputLinks(); | ||
} | ||
|
||
// builds AcoCorasick Trie | ||
private void buildTrie() { | ||
|
||
// Loop through each input pattern and building Trie | ||
for (int i = 0; i < patterns.length; i++) { | ||
Node curr = root; // Start at the root of the trie for each pattern | ||
|
||
// Loop through each character in the current pattern | ||
for (int j = 0; j < patterns[i].length(); j++) { | ||
char c = patterns[i].charAt(j); // Get the current character | ||
|
||
// Check if the current node has a child for the current character | ||
if (curr.getChild().containsKey(c)) { | ||
curr = curr.getChild().get(c); // Update the current node to the child node | ||
} else { | ||
// If no child node exists, create a new one and add it to the current node's children | ||
Node nn = new Node(); | ||
curr.getChild().put(c, nn); | ||
curr = nn; // Update the current node to the new child node | ||
} | ||
} | ||
curr.setPatternInd(i); // Store the index of the pattern in the current leaf node | ||
} | ||
} | ||
|
||
private void initializeSuffixLinksForChildNodesOfTheRoot(Queue<Node> q) { | ||
for (char rc : root.getChild().keySet()) { | ||
Node childNode = root.getChild().get(rc); | ||
q.add(childNode); // Add child node to the queue | ||
childNode.setSuffixLink(root); // Set suffix link to the root | ||
} | ||
} | ||
|
||
private void buildSuffixAndOutputLinks() { | ||
root.setSuffixLink(root); // Initialize the suffix link of the root to itself | ||
Queue<Node> q = new LinkedList<>(); // Initialize a queue for BFS traversal | ||
|
||
initializeSuffixLinksForChildNodesOfTheRoot(q); | ||
|
||
while (!q.isEmpty()) { | ||
Node currentState = q.poll(); // Get the current node for processing | ||
|
||
// Iterate through child nodes of the current node | ||
for (char cc : currentState.getChild().keySet()) { | ||
Node currentChild = currentState.getChild().get(cc); // Get the child node | ||
Node parentSuffix = currentState.getSuffixLink(); // Get the parent's suffix link | ||
|
||
// Calculate the suffix link for the child based on the parent's suffix link | ||
while (!parentSuffix.getChild().containsKey(cc) && parentSuffix != root) { | ||
parentSuffix = parentSuffix.getSuffixLink(); | ||
} | ||
|
||
// Set the calculated suffix link or default to root | ||
if (parentSuffix.getChild().containsKey(cc)) { | ||
currentChild.setSuffixLink(parentSuffix.getChild().get(cc)); | ||
} else { | ||
currentChild.setSuffixLink(root); | ||
} | ||
|
||
q.add(currentChild); // Add the child node to the queue for further processing | ||
} | ||
|
||
// Establish output links for nodes to efficiently identify patterns within patterns | ||
if (currentState.getSuffixLink().getPatternInd() >= 0) { | ||
currentState.setOutputLink(currentState.getSuffixLink()); | ||
} else { | ||
currentState.setOutputLink(currentState.getSuffixLink().getOutputLink()); | ||
} | ||
} | ||
} | ||
|
||
// Searches for patterns in the input text and records their positions | ||
public ArrayList<ArrayList<Integer>> searchIn(String text) { | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/* | ||
* positionByStringIndexValue is the list of list containing the indexes of words in patterns/dictionary | ||
* list index represents word. | ||
* eg - list[0] contains the list of start-index of word pattern[0] | ||
* list[1] contains the list of start-index of word pattern[1] | ||
* ...... | ||
* ...... | ||
* list[n] contains the list of start-index of word pattern[n] | ||
*/ | ||
ArrayList<ArrayList<Integer>> positionByStringIndexValue = new ArrayList<>(patterns.length); // Stores positions where patterns are found in the text | ||
Node parent = root; // Start searching from the root node | ||
|
||
for (int i = 0; i < patterns.length; i++) { | ||
positionByStringIndexValue.add(new ArrayList<Integer>()); // Initialize a list to store positions of the current pattern | ||
} | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
for (int i = 0; i < text.length(); i++) { | ||
char ch = text.charAt(i); // Get the current character in the text | ||
|
||
// Check if the current node has a child for the current character | ||
if (parent.getChild().containsKey(ch)) { | ||
parent = parent.getChild().get(ch); // Update the current node to the child node | ||
|
||
// If the current node represents a pattern, record its position in positionByStringIndexValue | ||
if (parent.getPatternInd() > -1) { | ||
positionByStringIndexValue.get(parent.getPatternInd()).add(i); | ||
} | ||
|
||
Node outputLink = parent.getOutputLink(); | ||
// Follow output links to find and record positions of other patterns | ||
while (outputLink != null) { | ||
positionByStringIndexValue.get(outputLink.getPatternInd()).add(i); | ||
outputLink = outputLink.getOutputLink(); | ||
} | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} else { | ||
// If no child node exists for the character, backtrack using suffix links | ||
while (parent != root && !parent.getChild().containsKey(ch)) { | ||
parent = parent.getSuffixLink(); | ||
} | ||
if (parent.getChild().containsKey(ch)) { | ||
i--; // Decrement i to reprocess the same character | ||
} | ||
} | ||
} | ||
setUpStartPoints(positionByStringIndexValue); | ||
return positionByStringIndexValue; | ||
} | ||
// by default positionByStringIndexValue contains end-points. This function converts those | ||
// endpoints to start points | ||
private void setUpStartPoints(ArrayList<ArrayList<Integer>> positionByStringIndexValue) { | ||
for (int i = 0; i < patterns.length; i++) { | ||
for (int j = 0; j < positionByStringIndexValue.get(i).size(); j++) { | ||
int endpoint = positionByStringIndexValue.get(i).get(j); | ||
positionByStringIndexValue.get(i).set(j, endpoint - patterns[i].length() + 1); | ||
} | ||
} | ||
} | ||
} | ||
|
||
// method to search for patterns in text | ||
public static Map<String, ArrayList<Integer>> search(String text, String[] patterns) { | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Trie trie = new AhoCorasick().new Trie(patterns); | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
ArrayList<ArrayList<Integer>> positionByStringIndexValue = trie.searchIn(text); | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return convert(positionByStringIndexValue, patterns); | ||
} | ||
|
||
// method for converting results to a map | ||
private static Map<String, ArrayList<Integer>> convert(ArrayList<ArrayList<Integer>> positionByStringIndexValue, String[] patterns) { | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Map<String, ArrayList<Integer>> positionByString = new HashMap<>(); | ||
for (int i = 0; i < patterns.length; i++) { | ||
String pattern = patterns[i]; | ||
ArrayList<Integer> positions = positionByStringIndexValue.get(i); | ||
positionByString.put(pattern, new ArrayList<>(positions)); | ||
} | ||
return positionByString; | ||
} | ||
} |
119 changes: 119 additions & 0 deletions
119
src/test/java/com/thealgorithms/strings/AhoCorasickTest.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
/* | ||
* Tests For Aho-Corasick String Matching Algorithm | ||
* | ||
* Author: Prabhat-Kumar-42 | ||
* GitHub: https://github.com/Prabhat-Kumar-42 | ||
*/ | ||
|
||
package com.thealgorithms.strings; | ||
|
||
import static org.junit.jupiter.api.Assertions.*; | ||
vil02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
import java.util.ArrayList; | ||
import java.util.Arrays; | ||
import java.util.Map; | ||
import org.junit.jupiter.api.BeforeEach; | ||
import org.junit.jupiter.api.Test; | ||
|
||
/** | ||
* This class contains test cases for the Aho-Corasick String Matching Algorithm. | ||
* The Aho-Corasick algorithm is used to efficiently find all occurrences of multiple | ||
* patterns in a given text. | ||
*/ | ||
class AhoCorasickTest { | ||
private String[] patterns; // The array of patterns to search for | ||
private String text; // The input text to search within | ||
|
||
/** | ||
* This method sets up the test environment before each test case. | ||
* It initializes the patterns and text to be used for testing. | ||
*/ | ||
@BeforeEach | ||
void setUp() { | ||
patterns = new String[] {"ACC", "ATC", "CAT", "GCG", "C", "T"}; | ||
text = "GCATCG"; | ||
} | ||
|
||
/** | ||
* Test searching for multiple patterns in the input text. | ||
* The expected results are defined for each pattern. | ||
*/ | ||
@Test | ||
void testSearch() { | ||
// Define the expected results for each pattern | ||
final var expected = Map.of("ACC", new ArrayList<>(Arrays.asList()), "ATC", new ArrayList<>(Arrays.asList(2)), "CAT", new ArrayList<>(Arrays.asList(1)), "GCG", new ArrayList<>(Arrays.asList()), "C", new ArrayList<>(Arrays.asList(1, 4)), "T", new ArrayList<>(Arrays.asList(3))); | ||
assertEquals(expected, AhoCorasick.search(text, patterns)); | ||
} | ||
|
||
/** | ||
* Test searching with an empty pattern array. | ||
* The result should be an empty map. | ||
*/ | ||
@Test | ||
void testEmptyPatterns() { | ||
// Define an empty pattern array | ||
final var emptyPatterns = new String[] {}; | ||
assertTrue(AhoCorasick.search(text, emptyPatterns).isEmpty()); | ||
} | ||
|
||
/** | ||
* Test searching for patterns that are not present in the input text. | ||
* The result should be an empty list for each pattern. | ||
*/ | ||
@Test | ||
void testPatternNotFound() { | ||
// Define patterns that are not present in the text | ||
final var searchPatterns = new String[] {"XYZ", "123"}; | ||
final var expected = Map.of("XYZ", new ArrayList<Integer>(), "123", new ArrayList<Integer>()); | ||
assertEquals(expected, AhoCorasick.search(text, searchPatterns)); | ||
} | ||
|
||
/** | ||
* Test searching for patterns that start at the beginning of the input text. | ||
* The expected position for each pattern is 0. | ||
*/ | ||
@Test | ||
void testPatternAtBeginning() { | ||
// Define patterns that start at the beginning of the text | ||
final var searchPatterns = new String[] {"GC", "GCA", "GCAT"}; | ||
final var expected = Map.of("GC", new ArrayList<Integer>(Arrays.asList(0)), "GCA", new ArrayList<Integer>(Arrays.asList(0)), "GCAT", new ArrayList<Integer>(Arrays.asList(0))); | ||
assertEquals(expected, AhoCorasick.search(text, searchPatterns)); | ||
} | ||
|
||
/** | ||
* Test searching for patterns that end at the end of the input text. | ||
* The expected positions are 4, 3, and 2 for the patterns. | ||
*/ | ||
@Test | ||
void testPatternAtEnd() { | ||
// Define patterns that end at the end of the text | ||
final var searchPatterns = new String[] {"CG", "TCG", "ATCG"}; | ||
final var expected = Map.of("CG", new ArrayList<Integer>(Arrays.asList(4)), "TCG", new ArrayList<Integer>(Arrays.asList(3)), "ATCG", new ArrayList<Integer>(Arrays.asList(2))); | ||
assertEquals(expected, AhoCorasick.search(text, searchPatterns)); | ||
} | ||
|
||
/** | ||
* Test searching for patterns with multiple occurrences in the input text. | ||
* The expected sizes are 1 and 1, and the expected positions are 2 and 3 | ||
* for the patterns "AT" and "T" respectively. | ||
*/ | ||
@Test | ||
void testMultipleOccurrencesOfPattern() { | ||
// Define patterns with multiple occurrences in the text | ||
final var searchPatterns = new String[] {"AT", "T"}; | ||
final var expected = Map.of("AT", new ArrayList<Integer>(Arrays.asList(2)), "T", new ArrayList<Integer>(Arrays.asList(3))); | ||
assertEquals(expected, AhoCorasick.search(text, searchPatterns)); | ||
} | ||
|
||
/** | ||
* Test searching for patterns in a case-insensitive manner. | ||
* The search should consider patterns regardless of their case. | ||
*/ | ||
@Test | ||
void testCaseInsensitiveSearch() { | ||
// Define patterns with different cases | ||
final var searchPatterns = new String[] {"gca", "aTc", "C"}; | ||
final var expected = Map.of("gca", new ArrayList<Integer>(), "aTc", new ArrayList<Integer>(), "C", new ArrayList<Integer>(Arrays.asList(1, 4))); | ||
assertEquals(expected, AhoCorasick.search(text, searchPatterns)); | ||
} | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.