What's the Time Complexity of Average Regex algorithms?

RegexAlgorithmComplexity Theory

Regex Problem Overview


I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.

I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)

Where could I go to learn more about implementing a regex engine?

Regex Solutions


Solution 1 - Regex

This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast . Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).

Solution 2 - Regex

Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?

Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)

What you have to do is:

  1. Find a way to convert a regex into an automaton
  2. Implement the recognition of the automaton in the programming language of your preference

That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.

This can be implemented in O(n), because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...

Solution 3 - Regex

The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.

Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.

If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionavgvstvsView Question on Stackoverflow
Solution 1 - RegexporgesView Answer on Stackoverflow
Solution 2 - RegexOscar MederosView Answer on Stackoverflow
Solution 3 - RegexmcdowellaView Answer on Stackoverflow