How does Chrome decide what to highlight when you double-click Japanese text?

If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:


For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.

How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.

Answers 1

  • So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.

    function tokenizeJA(text) {
      var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
      var words = []
      var cur = 0, prev = 0
      while (cur < text.length) {
        prev = cur
        cur =
        words.push(text.substring(prev, cur))
      return words
    // ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]

    I also made a jsfiddle that shows this.

    The quality is not amazing but I'm surprised this is supported at all.

Sorry, you do not have a permission to answer to this question.