Regex to match Egyptian Hieroglyphics

RegexUnicodeInternationalization

Regex Problem Overview


I want to know a regex to match the Egyptian Hieroglyphics. I am completely clueless and need your help.

I cannot post the letters as stack overflow doesnt seem to recognize it.

So can anyone let me know the unicode range for these characters.

Regex Solutions


Solution 1 - Regex

TLDNR: \p{Egyptian_Hieroglyphs}

###Javascript

Egyptian_Hieroglyphs belong to the "astral" plane that uses more than 16 bits to encode a character. Javascript, as of ES5, doesn't support astral planes (more on that) therefore you have to use surrogate pairs. The first surrogate is

U+13000 = d80c dc00

the last one is

U+1342E = d80d dc2e

that gives

re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g

t = document.getElementById("pyramid").innerHTML
document.write("<h1>Found</h1>" + t.match(re))

<div id="pyramid">

  some     ๐“€€	really    ๐“€	old    ๐“ฌ	stuff    ๐“ญ	    ๐“ฎ
  
  </div>

This is what it looks like with Noto Sans Egyptian Hieroglyphs installed:

enter image description here

###Other languages

On platforms that support UCS-4 you can use Egyptian codepoints 13000 to 1342F directly, but the syntax differs from system to system. For example, in Python (3.3 up) it will be [\U00013000-\U0001342E]:

>>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E"
>>> s
'some ๐“€€ really ๐“€ old ๐“ฌ stuff ๐“ญ ๐“ฎ'
>>> import re
>>> re.findall('[\U00013000-\U0001342E]', s)
['๐“€€', '๐“€', '๐“ฌ', '๐“ญ', '๐“ฎ']

Finally, if your regex engine supports unicode properties, you can (and should) use these instead of hardcoded ranges. For example in php/pcre:

$str = " some ๐“€€ really ๐“€ old ๐“ฌ stuff ๐“ญ  ๐“ฎ";

preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m);
print_r($m);

prints

[0] => Array
    (
        [0] => ๐“€€
        [1] => ๐“€
        [2] => ๐“ฌ
        [3] => ๐“ญ
        [4] => ๐“ฎ
    )

Solution 2 - Regex

Unicode encodes Egyptian hieroglyphs in the range from U+13000 โ€“ U+1342F (beyond the Basic Multilingual Plane).

In this case, there are 2 ways to write the regex:

  1. By specifying a character range from U+13000 โ€“ U+1342F.

    While specifying a character range in regex for characters in BMP is as easy as [a-z], depending on the language support, doing so for characters in astral planes might not be as simple.

  2. By specifying Unicode block for Egyptian hieroglyphs

    Since we are matching any character in Egyptian hieroglyphs block, this is the preferred way to write the regex where support is available.

Java

(Currently, I don't have any idea how other implementation of Java Class Libraries deal with astral plane characters in Pattern classes).

Sun/Oracle implementation

I'm not sure if it makes sense to talk about matching characters in astral planes in Java 1.4, since support for characters beyond BMP was only added in Java 5 by retrofitting the existing String implementation (which uses UCS-2 for its internal String representation) with code point-aware methods.

Since Java continues to allow lone surrogates (one which can't form a pair with other surrogate) to be specified in String, it resulted in a mess, since surrogates are not real characters, and lone surrogates are invalid in UTF-16.

Pattern class saw a major overhaul from Java 1.4.x to Java 5, as the class was rewritten to provide support for matching Unicode characters in astral planes: the pattern string is converted to an array of code point before it is parsed, and the input string is traversed by code point-aware methods in String class.

You can read more about the madness in Java regex in this answer by tchist.

I have written a detailed explanation on how to match a range of character which involves astral plane characters in this answer, so I am only going to include the code here. It also includes a few counter-examples of incorrect attempts to write regex to match astral plane characters.

Java 5 (and above)
"[\uD80C\uDC00-\uD80D\uDC2F]"
Java 7 (and above)
"[\\uD80C\\uDC00-\\uD80D\\uDC2F]"
"[\\x{13000}-\\x{1342F}]"

Since we are matching any code point belongs to the Unicode block, it can also be written as:

"\\p{InEgyptian_Hieroglyphs}"
"\\p{InEgyptian Hieroglyphs}"
"\\p{InEgyptianHieroglyphs}"

"\\p{block=EgyptianHieroglyphs}"
"\\p{blk=Egyptian Hieroglyphs}"

Java supported \p syntax for Unicode block since 1.4, but support for Egyptian Hieroglyphs block was only added in Java 7.

PCRE (used in PHP)

PHP example is already covered in georg's answer:

'~\p{Egyptian_Hieroglyphs}~u'

Note that u flag is mandatory if you want to match by code points instead of matching by code units.

Not sure if there is a better post on StackOverflow, but I have written some explanation on the effect of u flag (UTF mode) in this answer of mine.

One thing to note is Egyptian_Hieroglyphs is only available from PCRE 8.02 (or a version not earlier than PCRE 7.90).

As an alternative, you can specify a character range with \x{h...hh} syntax:

'~[\x{13000}-\x{1342F}]~u'

Note the mandatory u flag.

The \x{h...hh} syntax is supported from at least PCRE 4.50.

JavaScript (ECMAScript)

ES5

The character range method (which is the only way to do this in vanilla JavaScript) is already covered in georg's answer. The regex is modified a bit to cover the whole block, including the reserved unassigned code point.

/(?:\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2F])/

The solution above demonstrates the technique to match a range of character in astral plane, and also the limitations of JavaScript RegExp.

JavaScript also suffers from the same problem of string representation as Java. While Java did fix Pattern class in Java 5 to allow it to work with code points, JavaScript RegExp is still stuck in the days of UCS-2, forcing us to work with code units instead of code point in the regular expression.

ES6

Finally, support for code point matching is added in ECMAScript 6, which is made available via u flag to prevent breaking existing implementations in previous versions of ECMAScript.

Check Support section from the second link above for the list of browser providing experimental support for ES6 RegExp.

With the introduction of \u{h...hh} syntax in ES6, the character range can be rewritten in a manner similar to Java 7:

/[\u{13000}-\u{1342F}]/u

Or you can also directly specify the character in the RegExp literal, though the intention is not as clear cut as [a-z]:

/[๐“€€-๐“ฏ]/u

Note the u modifier in both regexes above.

Still got stuck with ES5? Don't worry, you can transpile ES6 Unicode RegExp to ES5 RegExp with regxpu.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser4628064View Question on Stackoverflow
Solution 1 - RegexgeorgView Answer on Stackoverflow
Solution 2 - RegexnhahtdhView Answer on Stackoverflow