Skip to content

Add AhoCorasick #4465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Oct 8, 2023
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
8bacb96
Added code to find Articulation Points and Bridges
Prabhat-Kumar-42 Sep 30, 2023
478bf04
tried to solve clang-formant test
Prabhat-Kumar-42 Sep 30, 2023
8e88928
removed new line at EOF to get lint to pass
Prabhat-Kumar-42 Sep 30, 2023
db5d447
feature: Added Ahocorasick Algorithm
Prabhat-Kumar-42 Sep 30, 2023
2cc9e3f
fixed lint using clang-format
Prabhat-Kumar-42 Sep 30, 2023
d2acaf6
removed datastructures/graphs/ArticulationPointsAndBridge.java from t…
Prabhat-Kumar-42 Oct 1, 2023
37c92ad
removed main, since test-file is added. Also modified and renamed few…
Prabhat-Kumar-42 Oct 1, 2023
efd912d
Added test-file for AhoCorasick Algorithm
Prabhat-Kumar-42 Oct 1, 2023
194a37d
Modified some comments in test-file
Prabhat-Kumar-42 Oct 1, 2023
288168c
Modified some comments in AhoCorasick.java
Prabhat-Kumar-42 Oct 1, 2023
6c4e2c2
lint fix
Prabhat-Kumar-42 Oct 1, 2023
ab22511
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 Oct 1, 2023
af71647
added few more test cases
Prabhat-Kumar-42 Oct 1, 2023
96f6231
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 Oct 1, 2023
59dfa0e
Modified some comments
Prabhat-Kumar-42 Oct 1, 2023
b135163
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 Oct 1, 2023
6a75149
Change all class fields to private, added initializeSuffixLinksForChi…
Prabhat-Kumar-42 Oct 2, 2023
dd50c78
Added Missing Test-Cases and more
Prabhat-Kumar-42 Oct 2, 2023
f464bf8
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 Oct 2, 2023
23b63cb
minor text changes
Prabhat-Kumar-42 Oct 2, 2023
42b21da
added direct test check i.e. defining a variable expected and just ch…
Prabhat-Kumar-42 Oct 2, 2023
106dfac
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 Oct 5, 2023
704b8e1
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 Oct 5, 2023
a28df8c
Created New Class Trie, merged 'buildTrie and buildSuffixAndOutputLin…
Prabhat-Kumar-42 Oct 6, 2023
b7cc61a
Updated TestFile according to the updated AhoCorasick Class. Added Fe…
Prabhat-Kumar-42 Oct 6, 2023
a164aab
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 Oct 6, 2023
f5defac
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 Oct 7, 2023
a1cbda6
updated - broken down constructor to relavent parts, made string fina…
Prabhat-Kumar-42 Oct 7, 2023
f60392b
lint fix clang
Prabhat-Kumar-42 Oct 7, 2023
62dfb86
Updated Tests Files
Prabhat-Kumar-42 Oct 7, 2023
1528cbf
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 Oct 7, 2023
8b9a831
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 Oct 7, 2023
69f3aa8
Added final field to Node class setters and Trie Constructor argument…
Prabhat-Kumar-42 Oct 8, 2023
fc9ca11
updated test file
Prabhat-Kumar-42 Oct 8, 2023
846bbda
lint fix clang
Prabhat-Kumar-42 Oct 8, 2023
82586b4
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 Oct 8, 2023
6f81843
minor chage - 'removed a comment'
Prabhat-Kumar-42 Oct 8, 2023
4fa8df1
added final fields to some arguments, class and variables, added a me…
Prabhat-Kumar-42 Oct 8, 2023
4a53103
updated to remove * inclusion and added the required modules only
Prabhat-Kumar-42 Oct 8, 2023
653db0b
Implemented a new class PatternPositionRecorder to wrap up the positi…
Prabhat-Kumar-42 Oct 8, 2023
a2ec697
Merge branch 'master' into AhoCorasick
Prabhat-Kumar-42 Oct 8, 2023
b409f7a
Added final fields to PatternPositionRecorder Class
Prabhat-Kumar-42 Oct 8, 2023
eb1c369
Merge branch 'AhoCorasick' of https://github.com/Prabhat-Kumar-42/Jav…
Prabhat-Kumar-42 Oct 8, 2023
b38f067
style: mark default constructor of `AhoCorasick` as `private`
vil02 Oct 8, 2023
c7c743b
style: remoce redundant `public`
vil02 Oct 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 227 additions & 0 deletions src/main/java/com/thealgorithms/strings/AhoCorasick.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
/*
* Aho-Corasick String Matching Algorithm Implementation
*
* This code implements the Aho-Corasick algorithm, which is used for efficient
* string matching in a given text. It can find multiple patterns simultaneously
* and records their positions in the text.
*
* Author: Prabhat-Kumar-42
* GitHub: https://github.com/Prabhat-Kumar-42
*/

package com.thealgorithms.strings;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.Map;
import java.util.Queue;

public class AhoCorasick {

// Trie Node Class
public class Node {
// Represents a character in the trie
private HashMap<Character, Node> child = new HashMap<>(); // Child nodes of the current node
private Node suffixLink; // Suffix link to another node in the trie
private Node outputLink; // Output link to another node in the trie
private int patternInd; // Index of the pattern that ends at this node

Node() {
this.suffixLink = null;
this.outputLink = null;
this.patternInd = -1;
}

public HashMap<Character, Node> getChild() {
return child;
}

public Node getSuffixLink() {
return suffixLink;
}

public void setSuffixLink(final Node suffixLink) {
this.suffixLink = suffixLink;
}

public Node getOutputLink() {
return outputLink;
}

public void setOutputLink(final Node outputLink) {
this.outputLink = outputLink;
}

public int getPatternInd() {
return patternInd;
}

public void setPatternInd(final int patternInd) {
this.patternInd = patternInd;
}
}

// Trie Class
public class Trie {

private Node root = null; // Root node of the trie
private final String[] patterns; // patterns according to which Trie is constructed

public Trie(final String[] patterns) {
root = new Node(); // Initialize the root of the trie
this.patterns = patterns;
buildTrie();
buildSuffixAndOutputLinks();
}

// builds AcoCorasick Trie
private void buildTrie() {

// Loop through each input pattern and building Trie
for (int i = 0; i < patterns.length; i++) {
Node curr = root; // Start at the root of the trie for each pattern

// Loop through each character in the current pattern
for (int j = 0; j < patterns[i].length(); j++) {
char c = patterns[i].charAt(j); // Get the current character

// Check if the current node has a child for the current character
if (curr.getChild().containsKey(c)) {
curr = curr.getChild().get(c); // Update the current node to the child node
} else {
// If no child node exists, create a new one and add it to the current node's children
Node nn = new Node();
curr.getChild().put(c, nn);
curr = nn; // Update the current node to the new child node
}
}
curr.setPatternInd(i); // Store the index of the pattern in the current leaf node
}
}

private void initializeSuffixLinksForChildNodesOfTheRoot(Queue<Node> q) {
for (char rc : root.getChild().keySet()) {
Node childNode = root.getChild().get(rc);
q.add(childNode); // Add child node to the queue
childNode.setSuffixLink(root); // Set suffix link to the root
}
}

private void buildSuffixAndOutputLinks() {
root.setSuffixLink(root); // Initialize the suffix link of the root to itself
Queue<Node> q = new LinkedList<>(); // Initialize a queue for BFS traversal

initializeSuffixLinksForChildNodesOfTheRoot(q);

while (!q.isEmpty()) {
Node currentState = q.poll(); // Get the current node for processing

// Iterate through child nodes of the current node
for (char cc : currentState.getChild().keySet()) {
Node currentChild = currentState.getChild().get(cc); // Get the child node
Node parentSuffix = currentState.getSuffixLink(); // Get the parent's suffix link

// Calculate the suffix link for the child based on the parent's suffix link
while (!parentSuffix.getChild().containsKey(cc) && parentSuffix != root) {
parentSuffix = parentSuffix.getSuffixLink();
}

// Set the calculated suffix link or default to root
if (parentSuffix.getChild().containsKey(cc)) {
currentChild.setSuffixLink(parentSuffix.getChild().get(cc));
} else {
currentChild.setSuffixLink(root);
}

q.add(currentChild); // Add the child node to the queue for further processing
}

// Establish output links for nodes to efficiently identify patterns within patterns
if (currentState.getSuffixLink().getPatternInd() >= 0) {
currentState.setOutputLink(currentState.getSuffixLink());
} else {
currentState.setOutputLink(currentState.getSuffixLink().getOutputLink());
}
}
}

// Searches for patterns in the input text and records their positions
public ArrayList<ArrayList<Integer>> searchIn(String text) {
/*
* positionByStringIndexValue is the list of list containing the indexes of words in patterns/dictionary
* list index represents word.
* eg - list[0] contains the list of start-index of word pattern[0]
* list[1] contains the list of start-index of word pattern[1]
* ......
* ......
* list[n] contains the list of start-index of word pattern[n]
*/
ArrayList<ArrayList<Integer>> positionByStringIndexValue = new ArrayList<>(patterns.length); // Stores positions where patterns are found in the text
Node parent = root; // Start searching from the root node

for (int i = 0; i < patterns.length; i++) {
positionByStringIndexValue.add(new ArrayList<Integer>()); // Initialize a list to store positions of the current pattern
}

for (int i = 0; i < text.length(); i++) {
char ch = text.charAt(i); // Get the current character in the text

// Check if the current node has a child for the current character
if (parent.getChild().containsKey(ch)) {
parent = parent.getChild().get(ch); // Update the current node to the child node

// If the current node represents a pattern, record its position in positionByStringIndexValue
if (parent.getPatternInd() > -1) {
positionByStringIndexValue.get(parent.getPatternInd()).add(i);
}

Node outputLink = parent.getOutputLink();
// Follow output links to find and record positions of other patterns
while (outputLink != null) {
positionByStringIndexValue.get(outputLink.getPatternInd()).add(i);
outputLink = outputLink.getOutputLink();
}
} else {
// If no child node exists for the character, backtrack using suffix links
while (parent != root && !parent.getChild().containsKey(ch)) {
parent = parent.getSuffixLink();
}
if (parent.getChild().containsKey(ch)) {
i--; // Decrement i to reprocess the same character
}
}
}
setUpStartPoints(positionByStringIndexValue);
return positionByStringIndexValue;
}
// by default positionByStringIndexValue contains end-points. This function converts those
// endpoints to start points
private void setUpStartPoints(ArrayList<ArrayList<Integer>> positionByStringIndexValue) {
for (int i = 0; i < patterns.length; i++) {
for (int j = 0; j < positionByStringIndexValue.get(i).size(); j++) {
int endpoint = positionByStringIndexValue.get(i).get(j);
positionByStringIndexValue.get(i).set(j, endpoint - patterns[i].length() + 1);
}
}
}
}

// method to search for patterns in text
public static Map<String, ArrayList<Integer>> search(String text, String[] patterns) {
Trie trie = new AhoCorasick().new Trie(patterns);
ArrayList<ArrayList<Integer>> positionByStringIndexValue = trie.searchIn(text);
return convert(positionByStringIndexValue, patterns);
}

// method for converting results to a map
private static Map<String, ArrayList<Integer>> convert(ArrayList<ArrayList<Integer>> positionByStringIndexValue, String[] patterns) {
Map<String, ArrayList<Integer>> positionByString = new HashMap<>();
for (int i = 0; i < patterns.length; i++) {
String pattern = patterns[i];
ArrayList<Integer> positions = positionByStringIndexValue.get(i);
positionByString.put(pattern, new ArrayList<>(positions));
}
return positionByString;
}
}
119 changes: 119 additions & 0 deletions src/test/java/com/thealgorithms/strings/AhoCorasickTest.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
/*
* Tests For Aho-Corasick String Matching Algorithm
*
* Author: Prabhat-Kumar-42
* GitHub: https://github.com/Prabhat-Kumar-42
*/

package com.thealgorithms.strings;

import static org.junit.jupiter.api.Assertions.*;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Map;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

/**
* This class contains test cases for the Aho-Corasick String Matching Algorithm.
* The Aho-Corasick algorithm is used to efficiently find all occurrences of multiple
* patterns in a given text.
*/
class AhoCorasickTest {
private String[] patterns; // The array of patterns to search for
private String text; // The input text to search within

/**
* This method sets up the test environment before each test case.
* It initializes the patterns and text to be used for testing.
*/
@BeforeEach
void setUp() {
patterns = new String[] {"ACC", "ATC", "CAT", "GCG", "C", "T"};
text = "GCATCG";
}

/**
* Test searching for multiple patterns in the input text.
* The expected results are defined for each pattern.
*/
@Test
void testSearch() {
// Define the expected results for each pattern
final var expected = Map.of("ACC", new ArrayList<>(Arrays.asList()), "ATC", new ArrayList<>(Arrays.asList(2)), "CAT", new ArrayList<>(Arrays.asList(1)), "GCG", new ArrayList<>(Arrays.asList()), "C", new ArrayList<>(Arrays.asList(1, 4)), "T", new ArrayList<>(Arrays.asList(3)));
assertEquals(expected, AhoCorasick.search(text, patterns));
}

/**
* Test searching with an empty pattern array.
* The result should be an empty map.
*/
@Test
void testEmptyPatterns() {
// Define an empty pattern array
final var emptyPatterns = new String[] {};
assertTrue(AhoCorasick.search(text, emptyPatterns).isEmpty());
}

/**
* Test searching for patterns that are not present in the input text.
* The result should be an empty list for each pattern.
*/
@Test
void testPatternNotFound() {
// Define patterns that are not present in the text
final var searchPatterns = new String[] {"XYZ", "123"};
final var expected = Map.of("XYZ", new ArrayList<Integer>(), "123", new ArrayList<Integer>());
assertEquals(expected, AhoCorasick.search(text, searchPatterns));
}

/**
* Test searching for patterns that start at the beginning of the input text.
* The expected position for each pattern is 0.
*/
@Test
void testPatternAtBeginning() {
// Define patterns that start at the beginning of the text
final var searchPatterns = new String[] {"GC", "GCA", "GCAT"};
final var expected = Map.of("GC", new ArrayList<Integer>(Arrays.asList(0)), "GCA", new ArrayList<Integer>(Arrays.asList(0)), "GCAT", new ArrayList<Integer>(Arrays.asList(0)));
assertEquals(expected, AhoCorasick.search(text, searchPatterns));
}

/**
* Test searching for patterns that end at the end of the input text.
* The expected positions are 4, 3, and 2 for the patterns.
*/
@Test
void testPatternAtEnd() {
// Define patterns that end at the end of the text
final var searchPatterns = new String[] {"CG", "TCG", "ATCG"};
final var expected = Map.of("CG", new ArrayList<Integer>(Arrays.asList(4)), "TCG", new ArrayList<Integer>(Arrays.asList(3)), "ATCG", new ArrayList<Integer>(Arrays.asList(2)));
assertEquals(expected, AhoCorasick.search(text, searchPatterns));
}

/**
* Test searching for patterns with multiple occurrences in the input text.
* The expected sizes are 1 and 1, and the expected positions are 2 and 3
* for the patterns "AT" and "T" respectively.
*/
@Test
void testMultipleOccurrencesOfPattern() {
// Define patterns with multiple occurrences in the text
final var searchPatterns = new String[] {"AT", "T"};
final var expected = Map.of("AT", new ArrayList<Integer>(Arrays.asList(2)), "T", new ArrayList<Integer>(Arrays.asList(3)));
assertEquals(expected, AhoCorasick.search(text, searchPatterns));
}

/**
* Test searching for patterns in a case-insensitive manner.
* The search should consider patterns regardless of their case.
*/
@Test
void testCaseInsensitiveSearch() {
// Define patterns with different cases
final var searchPatterns = new String[] {"gca", "aTc", "C"};
final var expected = Map.of("gca", new ArrayList<Integer>(), "aTc", new ArrayList<Integer>(), "C", new ArrayList<Integer>(Arrays.asList(1, 4)));
assertEquals(expected, AhoCorasick.search(text, searchPatterns));
}
}