How does Chrome decide what to highlight when you double-click Japanese text?

JavascriptGoogle ChromeCjk

Javascript Problem Overview


If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:

> どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。

For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.

How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.

Javascript Solutions


Solution 1 - Javascript

So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.

function tokenizeJA(text) {
  var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
  it.adoptText(text)
  var words = []
  
  var cur = 0, prev = 0
  
  while (cur < text.length) {
    prev = cur
    cur = it.next()
    words.push(text.substring(prev, cur))
  }
  
  return words
}

console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
// ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]

I also made a jsfiddle that shows this.

The quality is not amazing but I'm surprised this is supported at all.

Solution 2 - Javascript

Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."

Specifically, v8 uses ICU to do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator" for languages that don't have spaces, including Japanese, Chinese, Thai, etc.

And for your specific example of "薄暗い", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU (line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionpolm23View Question on Stackoverflow
Solution 1 - Javascriptpolm23View Answer on Stackoverflow
Solution 2 - JavascripterjiangView Answer on Stackoverflow