Skip to content

Commit 817c988

Browse files
Tweak example of using a Chinese word segmenter to one where the segmenter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker
See discussion at #539 (comment) for explanation
1 parent a5cf0db commit 817c988

File tree

1 file changed

+5
-7
lines changed

1 file changed

+5
-7
lines changed

test/diff/word.js

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -240,18 +240,16 @@ describe('WordDiff', function() {
240240

241241
it('supports tokenizing with an Intl.Segmenter', () => {
242242
// Example 1: Diffing Chinese text with no spaces.
243-
// I am not a Chinese speaker but I believe these sentences to mean:
244-
// 1. "I have (我有) many (很多) tables (桌子)"
245-
// 2. "Mei (梅) has (有) many (很多) sons (儿子)"
243+
// a. "He (他) has (有) many (很多) tables (桌子)"
244+
// b. "Mei (梅) has (有) many (很多) sons (儿子)"
246245
// We want to see that diffWords will get the word counts right and won't try to treat the
247246
// trailing 子 as common to both texts (since it's part of a different word each time).
248-
// TODO: Check with a Chinese speaker that this example is correct Chinese.
249247
const chineseSegmenter = new Intl.Segmenter('zh', {granularity: 'word'});
250248
const diffResult = diffWords('我有很多桌子。', '梅有很多儿子。', {intlSegmenter: chineseSegmenter});
251249
expect(diffResult).to.deep.equal([
252-
{ count: 1, added: false, removed: true, value: '我有' },
253-
{ count: 2, added: true, removed: false, value: '梅有' },
254-
{ count: 1, added: false, removed: false, value: '很多' },
250+
{ count: 1, added: false, removed: true, value: '' },
251+
{ count: 1, added: true, removed: false, value: '' },
252+
{ count: 2, added: false, removed: false, value: '有很多' },
255253
{ count: 1, added: false, removed: true, value: '桌子' },
256254
{ count: 1, added: true, removed: false, value: '儿子' },
257255
{ count: 1, added: false, removed: false, value: '。' }

0 commit comments

Comments
 (0)