Efficiently querying one string against multiple regexes

RegexAlgorithmPcre

Regex Problem Overview


Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?

EDIT: I have tried substituting it with DFA's (lex) The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.

Regex Solutions


Solution 1 - Regex

This is the way lexers work.

The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).

The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.

There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.

You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.

Solution 2 - Regex

I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.

I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.

I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.

Solution 3 - Regex

We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.

Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.

Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).

The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.

Solution 4 - Regex

Martin Sulzmann Has done quite a bit of work in this field. He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.

The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).

The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.

Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.


  1. this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!

Solution 5 - Regex

10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?

As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.

This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.

If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).

Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.

Solution 6 - Regex

If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.

Solution 7 - Regex

You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match

Solution 8 - Regex

Aho-Corasick was the answer for me.

I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.

Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.

I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.

import ahocorasick
A = ahocorasick.Automaton()

patterns = [
  [['cat','dog'],'mammals'],
  [['bass','tuna','trout'],'fish'],
  [['toad','crocodile'],'amphibians'],
]

for row in patterns:
    vals = row[0]
    for val in vals:
        A.add_word(val, (row[1], val))

A.make_automaton()

_string = 'tom loves lions tigers cats and bass'

def test():
  vals = []
  for item in A.iter(_string):
      vals.append(item)
  return vals

Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.

Solution 9 - Regex

You could combine them in groups of maybe 20.

(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)

As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.

If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...

Solution 10 - Regex

If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:

(r1)|(r2)|(r3)|...|(r10000)

where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.

Solution 11 - Regex

I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.

Solution 12 - Regex

I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.

disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.

Solution 13 - Regex

I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.

However, it seems that your solution of trying each one iteratively is going to be far easier.

Solution 14 - Regex

You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".

Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.

Solution 15 - Regex

I use Ragel with a leaving action:

action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/  % hello |
        /.+ello/ % ello |
        any{0,20} "ello"  % ello2 ;

The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.

Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.

Solution 16 - Regex

Try combining them into one big regex?

Solution 17 - Regex

I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.

The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.

Solution 18 - Regex

The fastest way to do it seems to be something like this (code is C#):

public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
    List<Regex> matches = new List<Regex>();
    foreach (Regex r in regexes)
    {
        if (r.IsMatch(string))
        {
            matches.Add(r);
        }
    }
    return matches;
}

Oh, you meant the fastest code? i don't know then....

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSridhar IyerView Question on Stackoverflow
Solution 1 - RegexRemo.DView Answer on Stackoverflow
Solution 2 - RegexWill HarrisView Answer on Stackoverflow
Solution 3 - RegexTim FarleyView Answer on Stackoverflow
Solution 4 - RegexShuggyCoUkView Answer on Stackoverflow
Solution 5 - RegexakdomView Answer on Stackoverflow
Solution 6 - RegexNDLeftyView Answer on Stackoverflow
Solution 7 - RegexEric WendelinView Answer on Stackoverflow
Solution 8 - RegexGlen ThompsonView Answer on Stackoverflow
Solution 9 - RegexMarkus JarderotView Answer on Stackoverflow
Solution 10 - RegexEfForEffortView Answer on Stackoverflow
Solution 11 - RegexAdrian D. ThurstonView Answer on Stackoverflow
Solution 12 - RegexJavierView Answer on Stackoverflow
Solution 13 - RegexwarrenView Answer on Stackoverflow
Solution 14 - Regexpaxos1977View Answer on Stackoverflow
Solution 15 - RegexhroptatyrView Answer on Stackoverflow
Solution 16 - RegexyfeldblumView Answer on Stackoverflow
Solution 17 - RegexMarcinView Answer on Stackoverflow
Solution 18 - RegexRCIXView Answer on Stackoverflow