How does Chrome decide what to highlight when you double-click Japanese text?

If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:

どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。

For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.

How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.

Answers 1

  • So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.

    function tokenizeJA(text) {
      var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
      it.adoptText(text)
      var words = []
    
      var cur = 0, prev = 0
    
      while (cur < text.length) {
        prev = cur
        cur = it.next()
        words.push(text.substring(prev, cur))
      }
    
      return words
    }
    
    console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
    // ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]
    

    I also made a jsfiddle that shows this.

    The quality is not amazing but I'm surprised this is supported at all.


Sorry, you do not have a permission to answer to this question.