The fastest C++ algorithm for string testing against a list of predefined seeds (case insensitive)

C++StringWindowsAlgorithm

C++ Problem Overview


I have list of seed strings, about 100 predefined strings. All strings contain only ASCII characters.

std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};

My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.

I need the fastest possible algorithm to test received string.

Right now my app uses this algo:

std::wstring testedStr;
for (auto & seed : seeds)
{
    if (boost::icontains(testedStr, seed))
    {
        return true;
    }
}
return false;

It works well, but I'm not sure that this is the most efficient way.

How is it possible to implement the algorithm in order to achieve better performance?

This is a Windows app. App receives valid std::wstring strings.


Update

For this task I implemented Aho-Corasick algo. If someone could review my code it would be great - I do not have big experience with such algorithms. Link to implementation: gist.github.com

C++ Solutions


Solution 1 - C++

If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.

This is also known as a trie, or Radix Tree.

For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:

enter image description here

A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").

There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.

Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.

This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.

Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)

Solution 2 - C++

You can use Aho–Corasick algorithm

It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.

It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)

Advantages:

  • It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
  • The algorithm is not hard to implement (comparing to suffix structures at least)

You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)

Solution 3 - C++

You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.

If you have C++11 you already have a standard regex library

#include <string>
#include <regex>

int main(){
     std::regex self_regex("google|yahoo|stackoverflow");
     regex_match(input_string ,self_regex);
}

I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)

Solution 4 - C++

One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm

Solution 5 - C++

Edit: As Matthieu M. pointed out, the OP asked if a string contains a keyword. My answer only works when the string equals the keyword or if you can split the string e.g. by the space character.

Especially with a high number of possible candidates and knowing them at compile time using a perfect hash function with a tool like gperf is worth a try. The main principle is, that you seed a generator with your seed and it generates a function that contains a hash function which has no collisions for all seed values. At runtime you give the function a string and it calculates the hash and then it checks if it is the only possible candidate corresponding to the hashvalue.

The runtime cost is hashing the string and then comparing against the only possible candidate (O(1) for seed size and O(1) for string length).

To make the comparison case insensitive you just use tolower on the seed and on your string.

Solution 6 - C++

Because number of string is not big (~100), you can use next algo:

  1. Calculate max length of word you have. Let it be N.
  2. Create int checks[N]; array of checksum.
  3. Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum. Create array of such maps for each length (up to N), it can be done at compile time also.
  4. Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array. After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.

It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.

So, let's say R is length of big string. Then looping over it will take O(R). Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.

Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block. And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).

Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.

So, finally that should give O(R), with quite small O. This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.

Solution 7 - C++

As a variant on DarioOO’s answer, you could get a possibly faster implementation of a regular expression match, by coding a lex parser for your strings. Though normally used together with yacc, this is a case where lex on its own does the job, and lex parsers are usually very efficient.

This approach might fall down if all your strings are long, as then an algorithm such as Aho-Corasick, Commentz-Walter or Rabin-Karp would probably offer significant improvements, and I doubt that lex implementations use any such algorithm.

This approach is harder if you have to be able to configure the strings without reconfiguration, but since flex is open source you could cannibalise its code.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionVictor MezrinView Question on Stackoverflow
Solution 1 - C++CinchBlueView Answer on Stackoverflow
Solution 2 - C++RiaDView Answer on Stackoverflow
Solution 3 - C++CoffeDeveloperView Answer on Stackoverflow
Solution 4 - C++LmTinyToonView Answer on Stackoverflow
Solution 5 - C++H. IddenView Answer on Stackoverflow
Solution 6 - C++ArkadyView Answer on Stackoverflow
Solution 7 - C++PJTraillView Answer on Stackoverflow