{"id":25546,"date":"2017-10-15T20:06:44","date_gmt":"2017-10-15T14:36:44","guid":{"rendered":"https:\/\/www.wikitechy.com\/technology\/?p=25546"},"modified":"2017-10-15T20:06:44","modified_gmt":"2017-10-15T14:36:44","slug":"c-programming-for-aho-corasick-algorithm-pattern-searching","status":"publish","type":"post","link":"https:\/\/www.wikitechy.com\/technology\/c-programming-for-aho-corasick-algorithm-pattern-searching\/","title":{"rendered":"C++ Programming for Aho-Corasick Algorithm for Pattern Searching"},"content":{"rendered":"<p>Given an input text and an array of k words, arr[], find all occurrences of all words in the input text. Let n be the length of text and m be the total number characters in all words, i.e. m = length(arr[0]) + length(arr[1]) + .. + O(n + length(arr[k-1]). Here k is total numbers of input words.<\/p>\n<p><strong>Example:<\/strong><\/p>\n<pre>Input: text = \"ahishers\"    \r\n       arr[] = {\"he\", \"she\", \"hers\", \"his\"}\r\n\r\nOutput:\r\n   Word <strong>his<\/strong> appears from 1 to 3\r\n   Word <strong>he<\/strong> appears from 4 to 5\r\n   Word <strong>she<\/strong> appears from 3 to 5\r\n   Word <strong>hers<\/strong> appears from 4 to 7<\/pre>\n<p>If we use a linear time searching algorithm like <strong>KMP<\/strong>, then we need to one by one search all words in text[]. This gives us total time complexity as O(n + length(word[0]) + O(n + length(word[1]) + O(n + length(word[2]) + \u2026 O(n + length(word[k-1]). This time complexity can be written as <em><strong>O(n*k + m)<\/strong><\/em>.<br \/>\n<strong>Aho-Corasick Algorithm<\/strong> finds all words in<em><strong> O(n + m + z)<\/strong><\/em> time where <strong>z <\/strong>is total number of occurrences of words in text. The Aho\u2013Corasick string matching algorithm formed the basis of the original Unix command fgrep.<\/p>\n<ol>\n<li><strong>Prepocessing : <\/strong>Build an automaton of all words in arr[] The automaton has mainly three functions:\n<ol>\n<li>\n<pre>Go To :   This function simply follows edges\r\n          of Trie of all words in arr[]. It is\r\n          represented as 2D array <strong>g[][] <\/strong>where\r\n          we store next state for current state \r\n          and character.\r\n\r\nFailure : This function stores all edges that are\r\n          followed when current character doesn't\r\n          have edge in Trie.  It is represented as\r\n          1D array <strong>f[]<\/strong> where we store next state for\r\n          current state. \r\n\r\nOutput :  Stores indexes of all words that end at \r\n          current state. It is represented as 1D \r\n          array <strong>o[]<\/strong> where we store indexes\r\n          of all matching words as a bitmap for \r\n          current state.\r\n<\/pre>\n<\/li>\n<li><strong>Matching :<\/strong> Traverse the given text over built automaton to find all matching words.<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n[ad type=&#8221;banner&#8221;]\n<ol>\n<li><strong>Preprocessing:<\/strong>\n<ol>\n<li>We first Build a <a href=\"http:\/\/www.geeksforgeeks.org\/trie-insert-and-search\/\" target=\"_blank\" rel=\"noopener\">Trie<\/a> (or Keyword Tree) of all words.<img fetchpriority=\"high\" decoding=\"async\" class=\"aligncenter size-full wp-image-25554\" src=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2017\/05\/aho-corasick1.png\" alt=\"C++ Programming for Aho-Corasick Algorithm for Pattern Searching\" width=\"530\" height=\"306\" srcset=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2017\/05\/aho-corasick1.png 530w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2017\/05\/aho-corasick1-300x173.png 300w\" sizes=\"(max-width: 530px) 100vw, 530px\" \/><\/li>\n<li>Next we extend Trie into an automaton to support linear time matching.<img decoding=\"async\" class=\"aligncenter size-full wp-image-25556\" src=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2017\/05\/aho-corasick2-660x379.png\" alt=\"Next we extend Trie into an automaton to support linear time matching.\" width=\"660\" height=\"379\" srcset=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2017\/05\/aho-corasick2-660x379.png 660w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2017\/05\/aho-corasick2-660x379-300x172.png 300w\" sizes=\"(max-width: 660px) 100vw, 660px\" \/><\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p>Go to :<br \/>\nWe build Trie. And for all characters which don\u2019t have an edge at root, we add an edge back to root.<\/p>\n<p>Failure :<br \/>\nFor a state s, we find the longest proper suffix which is a proper prefix of some pattern. This is done using Breadth First Traversal of Trie.<\/p>\n<p>Output :<br \/>\nFor a state s, indexes of all words ending at s are stored. These indexes are stored as bitwise map (by doing bitwise OR of values). This is also computing using Breadth First Traversal with Failure.<\/p>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <span class=\"code-embed-name\">C++<\/span> <\/div> <pre class=\"language-cpp code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-cpp code-embed-code\">\/\/ C++ program for implementation of Aho Corasick algorithm<br\/>\/\/ for string matching<br\/>using namespace std;<br\/>#include &lt;bits\/stdc++.h&gt;<br\/> <br\/>\/\/ Max number of states in the matching machine.<br\/>\/\/ Should be equal to the sum of the length of all keywords.<br\/>const int MAXS = 500;<br\/> <br\/>\/\/ Maximum number of characters in input alphabet<br\/>const int MAXC = 26;<br\/> <br\/>\/\/ OUTPUT FUNCTION IS IMPLEMENTED USING out[]<br\/>\/\/ Bit i in this mask is one if the word with index i<br\/>\/\/ appears when the machine enters this state.<br\/>int out[MAXS];<br\/> <br\/>\/\/ FAILURE FUNCTION IS IMPLEMENTED USING f[]<br\/>int f[MAXS];<br\/> <br\/>\/\/ GOTO FUNCTION (OR TRIE) IS IMPLEMENTED USING g[][]<br\/>int g[MAXS][MAXC];<br\/> <br\/>\/\/ Builds the string matching machine.<br\/>\/\/ arr -   array of words. The index of each keyword is important:<br\/>\/\/         &quot;out[state] &amp; (1 &lt;&lt; i)&quot; is &gt; 0 if we just found word[i]<br\/>\/\/         in the text.<br\/>\/\/ Returns the number of states that the built machine has.<br\/>\/\/ States are numbered 0 up to the return value - 1, inclusive.<br\/>int buildMatchingMachine(string arr[], int k)<br\/>{<br\/>    \/\/ Initialize all values in output function as 0.<br\/>    memset(out, 0, sizeof out);<br\/> <br\/>    \/\/ Initialize all values in goto function as -1.<br\/>    memset(g, -1, sizeof g);<br\/> <br\/>    \/\/ Initially, we just have the 0 state<br\/>    int states = 1;<br\/> <br\/>    \/\/ Construct values for goto function, i.e., fill g[][]<br\/>    \/\/ This is same as building a Trie for arr[]<br\/>    for (int i = 0; i &lt; k; ++i)<br\/>    {<br\/>        const string &amp;word = arr[i];<br\/>        int currentState = 0;<br\/> <br\/>        \/\/ Insert all characters of current word in arr[]<br\/>        for (int j = 0; j &lt; word.size(); ++j)<br\/>        {<br\/>            int ch = word[j] - &#039;a&#039;;<br\/> <br\/>            \/\/ Allocate a new node (create a new state) if a<br\/>            \/\/ node for ch doesn&#039;t exist.<br\/>            if (g[currentState][ch] == -1)<br\/>                g[currentState][ch] = states++;<br\/> <br\/>            currentState = g[currentState][ch];<br\/>        }<br\/> <br\/>        \/\/ Add current word in output function<br\/>        out[currentState] |= (1 &lt;&lt; i);<br\/>    }<br\/> <br\/>    \/\/ For all characters which don&#039;t have an edge from<br\/>    \/\/ root (or state 0) in Trie, add a goto edge to state<br\/>    \/\/ 0 itself<br\/>    for (int ch = 0; ch &lt; MAXC; ++ch)<br\/>        if (g[0][ch] == -1)<br\/>            g[0][ch] = 0;<br\/> <br\/>    \/\/ Now, let&#039;s build the failure function<br\/> <br\/>    \/\/ Initialize values in fail function<br\/>    memset(f, -1, sizeof f);<br\/> <br\/>    \/\/ Failure function is computed in breadth first order<br\/>    \/\/ using a queue<br\/>    queue&lt;int&gt; q;<br\/> <br\/>     \/\/ Iterate over every possible input<br\/>    for (int ch = 0; ch &lt; MAXC; ++ch)<br\/>    {<br\/>        \/\/ All nodes of depth 1 have failure function value<br\/>        \/\/ as 0. For example, in above diagram we move to 0<br\/>        \/\/ from states 1 and 3.<br\/>        if (g[0][ch] != 0)<br\/>        {<br\/>            f[g[0][ch]] = 0;<br\/>            q.push(g[0][ch]);<br\/>        }<br\/>    }<br\/> <br\/>    \/\/ Now queue has states 1 and 3<br\/>    while (q.size())<br\/>    {<br\/>        \/\/ Remove the front state from queue<br\/>        int state = q.front();<br\/>        q.pop();<br\/> <br\/>        \/\/ For the removed state, find failure function for<br\/>        \/\/ all those characters for which goto function is<br\/>        \/\/ not defined.<br\/>        for (int ch = 0; ch &lt;= MAXC; ++ch)<br\/>        {<br\/>            \/\/ If goto function is defined for character &#039;ch&#039;<br\/>            \/\/ and &#039;state&#039;<br\/>            if (g[state][ch] != -1)<br\/>            {<br\/>                \/\/ Find failure state of removed state<br\/>                int failure = f[state];<br\/> <br\/>                \/\/ Find the deepest node labeled by proper<br\/>                \/\/ suffix of string from root to current<br\/>                \/\/ state.<br\/>                while (g[failure][ch] == -1)<br\/>                      failure = f[failure];<br\/> <br\/>                failure = g[failure][ch];<br\/>                f[g[state][ch]] = failure;<br\/> <br\/>                \/\/ Merge output values<br\/>                out[g[state][ch]] |= out[failure];<br\/> <br\/>                \/\/ Insert the next level node (of Trie) in Queue<br\/>                q.push(g[state][ch]);<br\/>            }<br\/>        }<br\/>    }<br\/> <br\/>    return states;<br\/>}<br\/> <br\/>\/\/ Returns the next state the machine will transition to using goto<br\/>\/\/ and failure functions.<br\/>\/\/ currentState - The current state of the machine. Must be between<br\/>\/\/                0 and the number of states - 1, inclusive.<br\/>\/\/ nextInput - The next character that enters into the machine.<br\/>int findNextState(int currentState, char nextInput)<br\/>{<br\/>    int answer = currentState;<br\/>    int ch = nextInput - &#039;a&#039;;<br\/> <br\/>    \/\/ If goto is not defined, use failure function<br\/>    while (g[answer][ch] == -1)<br\/>        answer = f[answer];<br\/> <br\/>    return g[answer][ch];<br\/>}<br\/> <br\/>\/\/ This function finds all occurrences of all array words<br\/>\/\/ in text.<br\/>void searchWords(string arr[], int k, string text)<br\/>{<br\/>    \/\/ Preprocess patterns.<br\/>    \/\/ Build machine with goto, failure and output functions<br\/>    buildMatchingMachine(arr, k);<br\/> <br\/>    \/\/ Initialize current state<br\/>    int currentState = 0;<br\/> <br\/>    \/\/ Traverse the text through the nuilt machine to find<br\/>    \/\/ all occurrences of words in arr[]<br\/>    for (int i = 0; i &lt; text.size(); ++i)<br\/>    {<br\/>        currentState = findNextState(currentState, text[i]);<br\/> <br\/>        \/\/ If match not found, move to next state<br\/>        if (out[currentState] == 0)<br\/>             continue;<br\/> <br\/>        \/\/ Match found, print all matching words of arr[]<br\/>        \/\/ using output function.<br\/>        for (int j = 0; j &lt; k; ++j)<br\/>        {<br\/>            if (out[currentState] &amp; (1 &lt;&lt; j))<br\/>            {<br\/>                cout &lt;&lt; &quot;Word &quot; &lt;&lt; arr[j] &lt;&lt; &quot; appears from &quot;<br\/>                     &lt;&lt; i - arr[j].size() + 1 &lt;&lt; &quot; to &quot; &lt;&lt; i &lt;&lt; endl;<br\/>            }<br\/>        }<br\/>    }<br\/>}<br\/> <br\/>\/\/ Driver program to test above<br\/>int main()<br\/>{<br\/>    string arr[] = {&quot;he&quot;, &quot;she&quot;, &quot;hers&quot;, &quot;his&quot;};<br\/>    string text = &quot;ahishers&quot;;<br\/>    int k = sizeof(arr)\/sizeof(arr[0]);<br\/> <br\/>    searchWords(arr, k, text);<br\/> <br\/>    return 0;<br\/>}<\/code><\/pre> <\/div>\n<p>Output:<\/p>\n<pre>Word his appears from 1 to 3\r\nWord he appears from 4 to 5\r\nWord she appears from 3 to 5\r\nWord hers appears from 4 to 7<\/pre>\n[ad type=&#8221;banner&#8221;]\n","protected":false},"excerpt":{"rendered":"<p>C++ Programming for Aho-Corasick Algorithm for Pattern Searching &#8211; Searching and sorting &#8211; Aho-Corasick Algorithm finds all words in O(n + m + z) time.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[69969,83515,1,71670],"tags":[73836,73838,72306,73832,73273,73822,73322,73827,73851,73843,73848,73840,73835,73826,73315,73844,73852,73846,73830,73222,73833,73834,73850,73824,73855,73192,73270,73600,73274,73829,73847,73823,73858,70097,73841,73188],"class_list":["post-25546","post","type-post","status-publish","format-standard","hentry","category-algorithm","category-c-programming-3","category-coding","category-searching-and-sorting","tag-aho-streaming","tag-algorithm-animation-visualization","tag-algorithm-visualization","tag-bca-dictionary","tag-best-pattern-matching-algorithm","tag-bfs-geeksforgeeks","tag-boyer-moore-pattern-matching-algorithm-example","tag-cashew","tag-david-barber","tag-david-machine","tag-deep-learning-time-series","tag-deep-packet-inspection-open-source","tag-dictionary-match","tag-hotmail","tag-implementation-of-algorithm-in-c","tag-intrusion-detection-and-prevention-systems","tag-javascript-search-string","tag-julia-machine-learning","tag-keyword-matching-algorithm","tag-kmp-string-matching-algorithm","tag-link-dictionary","tag-machine-learning-string-matching","tag-machine-learning-time-series","tag-match-dictionary","tag-matching-engine","tag-naive-string-matching-algorithm","tag-pattern-matching-algorithm","tag-pattern-matching-algorithm-for-intrusion-detection-system","tag-pattern-matching-algorithm-ppt","tag-pdf-splitter-offline","tag-phonetic-search","tag-search-animation","tag-search-string-in-javascript","tag-searching-algorithms-in-c","tag-selenium-documentation-pdf","tag-string-match"],"_links":{"self":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/posts\/25546","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/comments?post=25546"}],"version-history":[{"count":0,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/posts\/25546\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/media?parent=25546"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/categories?post=25546"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/tags?post=25546"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}