Fuzzy Regular Expressions

RegexStringFuzzy SearchFuzzy ComparisonTre Library

Regex Problem Overview


In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

Do you have any ideas how to do this effectively?

Regex Solutions


Solution 1 - Regex

I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

and got

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March

Thanks a lot for all your suggestions.

Solution 2 - Regex

Also see: The Python regex (newer version, Oct '14) (search for "fuzzy" inside the document).

If you're not a Python guy (neither I am), you could compile your code to C (exe/dll). Then you would be able to use your dll even from good old vb6 (and the like).

Other libraries to choose from:

  • TRE/agrep ('classic, good, old and fast) (search for 'agrep performace'), but you need to write POSIX compatible regex (search for 'regular expressions info posix') Of course, all libraries/examples using TRE have this limitation (search for 'hackerboss approximate regex matching in python'). For massive data: search for 'A fast CUDA implementation of agrep algorithm'.
  • FREJ (Java) - some (more) limitations (for instance, no look ahead/look behind)
  • fuzzy-wuzzy (Python-based) - worth looking at, not tested...

Search also for:

  • 'Comparison_of_regular_expression_engines'
  • 'regular-expressions.info tools'

(sorry for not be able to post real links)

Solution 3 - Regex

I just use the regex module: 'Alternative regular expression module, to replace re.' It provides the familiarity of re but includes options for fuzzy matching, along with several other improvements on re.

For Windows binaries, see this resource.

Solution 4 - Regex

Here is a resource on the question you are asking. It is a bit of a teaser for a company. More useful might be this paper. I've seen an implementation inspired by the paper that could do a fuzzy search, biased for special language (e.g. Arabic vs. English), on a large dataset.

In general, you won't be able to do what you asked about. You can make a regexp search fuzzy by replacing characters with equivalence classes, or you can search a database for near-matches defined by Levenshtein distance. Trying to expand the (n)DFA behind a regexp to include near-matches by distance would rapidly become impossibly complex.

Solution 5 - Regex

Have you considered using a lexer?

I've never actually used one so i can't be much help, but it sounds like it fits!

Solution 6 - Regex

I started to implement a Java tool called prex for approximate regular expression matching. The tool determines how far a string s is from matching a regular expression r, i.e. how many insertions, deletions and substitutions on s are at least required (minimum cost) such that the resulting string s' is acceptable by r. If your are interested, you can check out the code from https://github.com/julianthome/prex. I would be very happy to get some feedback. Note, that the approach is still a bit slow, but I am currently incorporating some heuristics for improving its performance.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionThomas AhleView Question on Stackoverflow
Solution 1 - RegexThomas AhleView Answer on Stackoverflow
Solution 2 - RegexMihail StanculescuView Answer on Stackoverflow
Solution 3 - RegexDavid CView Answer on Stackoverflow
Solution 4 - RegexbmarguliesView Answer on Stackoverflow
Solution 5 - RegexPaul CreaseyView Answer on Stackoverflow
Solution 6 - RegexJulianView Answer on Stackoverflow