Regular expression to search for Gadaffi

RegexSearch

Regex Problem Overview


I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?

This is a list of 30 variants:

Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi

My best attempt so far is:

\b[KG]h?add?af?fi$\b

But I still seem to be missing some variants. Any suggestions?

Regex Solutions


Solution 1 - Regex

Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.

Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.

Solution 2 - Regex

\b[KGQ]h?add?h?af?fi\b

Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).

Btw, why is there a $ at the end of the regex?


Btw, nice article on the topic:

Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.


EDIT

To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D

\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b

Solution 3 - Regex

One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')

G310, K310, Q310

Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.

<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');

$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";

$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
	$rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
	if ($rate > 1){
		$matches[] = $item;
	}
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>

A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.

Solution 4 - Regex

Using CPAN module Regexp::Assemble:

#!/usr/bin/env perl

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
                    Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
                    Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
                    Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
                    Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
                    Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;

This produces the following regular expression:

(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))

Solution 5 - Regex

I think you're over complicating things here. The correct regex is as simple as:

\u0627\u0644\u0642\u0630\u0627\u0641\u064a

It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).

Solution 6 - Regex

If you want to avoid matching things that no-one has used (ie avoid tending towards ".+") your best approach would be to create a regular expression that's just all the alternatives (eg. (Qadafi|Kadafi|...)) then compile that to a DFA, and then convert the DFA back into a regular expression. Assuming a moderately sensible implementation that would give you a "compressed" regular expression that's guaranteed not to contain unexpected variants.

Solution 7 - Regex

If you've got a concrete listing of all 30 possibilities, just concatenate them all together with a bunch of "ors". Then you can be sure that it only matches the exact things you've listed, and no more. Your RE engine will probably be able to optimize in further, and, well, with 30 choices even if it doesn't it's still not a big deal. Trying to fiddle around with manually turning it into a "clever" RE can't possibly turn out better and may turn out worse.

Solution 8 - Regex

(G|Gh|K|Kh|Q|Qh|Q|Qu)(a|au|e|u)(dh|zz|th|d|dd)(dh|th|a|ha|)(\x27|)(a|)(ff|f)(i|y)

Certainly not the most optimized version, split on syllables to maximize matches while trying to make sure we don't get false positives.

Solution 9 - Regex

Well since you are matching small words why don't you try a similarity search engine with the Levenshtein distance? You can allow at most k insertions or deletions. This way you can change the distance function to other things that work better for your specific problem. There are many functions available in the simMetrics library.

Solution 10 - Regex

A possible alternative is the online tool for generate regular expressions from examples http://regex.inginf.units.it. Give it a chance!

Solution 11 - Regex

Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.

Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.

But I can see patterns for some of the variants, and so I ended up with this:

\b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b

At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.

See it here on www.rubular.com

Solution 12 - Regex

I know this is an old question, but...

Neither of these two regexes is the prettiest, but they are optimized and both match ALL the variations in the original post.

"Little Beauty" #1

(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi)

"Little Beauty" #2

(?:(?:Gh|[GK])adaff|(?:(?:Gh|[GKQ])ad|(?:Ghe|(?:[GK]h|[GKQ])a)dd|(?:Gadd|(?:[GKQ]a|Q(?:adh|u))d|(?:Qad|(?:Qu|[GQ])a)t)h|Ka(?:zz|d'))af)i|(?:Khadaff|(?:(?:Kh|G)ad|Gh?add)af)y

Rest in Peace, Muammar.

Solution 13 - Regex

Just an addendum: you should add "Gheddafi" as alternate spelling. So the RE should be

\b[KG]h?[ae]dd?af?fi$\b

Solution 14 - Regex

> [GQK][ahu]+[dtez]+'?[adhz]+f{1,2}(i|y)

In parts:

  • [GQK]
  • [ahu]+
  • [dtez]+
  • '?
  • [adhz]+
  • f{1,2}(i|y)

Note: Just wanted to give a shot at this.

Solution 15 - Regex

What else starts with Q, G, or K, has a d, z or t in the middle, and ends in "fi" the people actually search for?

/\b[GQK].+[dzt].+fi\b/i

Done.

>>> print re.search(a, "Gadasadasfiasdas") != None
False
>>> print re.search(a, "Gadasadasfi") != None
True
>>> print re.search(a, "Qa'dafi") != None
True

Interesting that I'm getting downvoted. Can someone leave some false positives in the comments?

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSiggyFView Question on Stackoverflow
Solution 1 - RegexChris PacejoView Answer on Stackoverflow
Solution 2 - RegexCzechnologyView Answer on Stackoverflow
Solution 3 - RegextomwalshamView Answer on Stackoverflow
Solution 4 - RegexPrakash KView Answer on Stackoverflow
Solution 5 - RegexStaffan NötebergView Answer on Stackoverflow
Solution 6 - Regexandrew cookeView Answer on Stackoverflow
Solution 7 - RegexJeremy BowersView Answer on Stackoverflow
Solution 8 - RegexSneakyView Answer on Stackoverflow
Solution 9 - RegexArnoldo MullerView Answer on Stackoverflow
Solution 10 - RegexmimmuzView Answer on Stackoverflow
Solution 11 - RegexstemaView Answer on Stackoverflow
Solution 12 - Regexzx81View Answer on Stackoverflow
Solution 13 - RegexVito De TullioView Answer on Stackoverflow
Solution 14 - RegexDinko PeharView Answer on Stackoverflow
Solution 15 - RegexHankView Answer on Stackoverflow