Matching accented characters with Javascript regexes

JavascriptRegexUnicodeInternationalization

Javascript Problem Overview


Here's a fun snippet I ran into today:

/\ba/.test("a") --> true
/\bà/.test("à") --> false

However,

/à/.test("à") --> true

Firstly, wtf?

Secondly, if I want to match an accented character at the start of a word, how can I do that? (I'd really like to avoid using over-the-top selectors like /(?:^|\s|'|\(\) ....)

Javascript Solutions


Solution 1 - Javascript

This worked for me:

/^[a-z\u00E0-\u00FC]+$/i

With help from here

Solution 2 - Javascript

The reason why /\bà/.test("à") doesn't match is because "à" is not a word character. The escape sequence \b matches only between a boundary of word character and a non word character. /\ba/.test("a") matches because "a" is a word character. Because of that, there is a boundary between the beginning of the string (which is not a word character) and the letter "a" which is a word character.

Word characters in JavaScript's regex is defined as [a-zA-Z0-9_].

To match an accented character at the start of a string, just use the ^ character at the beginning of the regex (e.g. /^à/). That character means the beginning of the string (unlike \b which matches at any word boundary within the string). It's most basic and standard regular expression, so it's definitely not over the top.

Solution 3 - Javascript

Stack Overflow had also an issue with non ASCII characters in regex, you can find it here. They are not coping with word boundaries, but maybe gives you anyway useful hints.

There is another page, but he wants to match strings and not words.

I don't know, and did not find now, an anchor for your problem, but when I see what monster regexes in my first link are used, your group, that you want to avoid, is not over the top and to my opinion your solution.

Solution 4 - Javascript

const regex = /^[\-/A-Za-z\u00C0-\u017F ]+$/;
const test1 = regex.test("à");
const test2 = regex.test("Martinez-Cortez");
const test3 = regex.test("Leonardo da vinci");
const test4 = regex.test("ï");

console.log('test1', test1);
console.log('test2', test2);
console.log('test3', test3);
console.log('test4', test4);

Building off of Wak's and Cœur's answer:

/^[\-/A-Za-z\u00C0-\u017F ]+$/

Works for spaces and dashes too.

Example: Leonardo da vinci, Martinez-Cortez

Solution 5 - Javascript

If you want to match letters, whether or not they're accented, unicode property escapes can be helpful.

/\p{Letter}*/u.test("à"); // true
/\p{Letter}/u.test('œ'); // true
/\p{Letter}/u.test('a'); // true
/\p{Letter}/u.test('3'); // false
/\p{Letter}/u.test('a'); // true

Matching to the start of a word is tricky, but (?<=(?:^|\s)) seems to do the trick. The (?<= ) is a positive lookbehind, ensuring that something exists before the main expression. The (?: ) is a non-capture group, so you don't end up with a reference to this part in whatever match you use later. Then the ^ will match the start of the string if the multiline flag isn't set or the start of the line if the multiline flag is set and the \s will match a whitespace character (space/tab/linebreak).

So using them together, it would look something like:

/(?<=(?:^|\s))\p{Letter}*/u

If you want to only match accented characters to the start of the string, you'd want a negated character set for a-zA-Z.

/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("bœ") // false
/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("œb") // true

// Match characters, accented or not
let regex = /\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // true
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false
console.log(regex.test("16 tons")); // true
console.log(regex.test("3 œ")); // true

console.log('-----');

// Match characters to start of line, only match characters

regex = /(?<=(?:^|\s))\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // true
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false

console.log('----');

// Match accented character to start of word, only match characters

regex = /(?<=(?:^|\s))[^a-zA-Z]\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // false
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false

Solution 6 - Javascript

Unicode allows for two alternative but equivalent representations of some accented characters. For example, é has two Unicode representations: '\u0039' and '\u0065\u0301'. The former is called composed form and the latter is called decomposed form. JavaScript allows for conversion between the two:

'é'.normalize('NFD') // decompose: '\u0039' -> '\u0065\u0301'
'é'.normalize('NFC') // compose: '\u0065\u0301' -> '\u0039'
'é'.length // composed form: -> 1
'é'.length // decomposed form: -> 2 (looks identical but has different representation)
'é' == 'é' // -> false (composed and decomposed strings are not equal)

The code point '\u0301' belongs to the Unicode Combining Diacritical Marks code block 0300-036F. So one way to match these accented characters is to compare them in decomposed form:

// matching accented characters
/[a-zA-Z][\u0300-\u036f]+/.test('é'.normalize('NFD')) // -> true
/\bé/.test('é') // -> false
/\bé/.test('é'.normalize('NFD')) // -> true (NOTE: /\bé/ uses the decomposed form)

// matching accented words
/^\w+$/.test('résumé') // -> false
/^(?:[a-zA-Z][\u0300-\u036f]*)+$/.test('résumé'.normalize('NFD')) // -> true

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionnickfView Question on Stackoverflow
Solution 1 - JavascriptWakView Answer on Stackoverflow
Solution 2 - JavascriptRiimuView Answer on Stackoverflow
Solution 3 - JavascriptstemaView Answer on Stackoverflow
Solution 4 - JavascriptCraig1123View Answer on Stackoverflow
Solution 5 - JavascriptAmy ShacklesView Answer on Stackoverflow
Solution 6 - JavascriptvirtuosoView Answer on Stackoverflow