Tweak example of using a Chinese word segmenter to one where the segmenter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker #613

ExplodingCabbage · 2025-05-19T09:30:49Z

See discussion at #539 (comment) for explanation

…enter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker See discussion at #539 (comment) for explanation

Somehow I screwed up #613 and merged it with the tests failing and with the sentence I was actually using in the test inconsistent with the one I claimed to be using in the comment above. Also, even if I'd got it right, I wouldn't've actually avoided hitting the inconsistency in Intl.Segmenter's tokenization rules that that PR was specifically trying to avoid, because it considers 他有 (he has) to be one word; I should've used 她有 (she has) which the segmenter sees as two words. This fixes both mistakes.

Tweak example of using a Chinese word segmenter to one where the segm…

817c988

…enter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker See discussion at #539 (comment) for explanation

ExplodingCabbage merged commit 0365f4b into master May 19, 2025

ExplodingCabbage deleted the better-chinese branch May 19, 2025 09:31

ExplodingCabbage mentioned this pull request May 22, 2025

Fix broken diffWords test #617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tweak example of using a Chinese word segmenter to one where the segmenter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker #613

Tweak example of using a Chinese word segmenter to one where the segmenter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker #613

ExplodingCabbage commented May 19, 2025

Uh oh!

Uh oh!

Tweak example of using a Chinese word segmenter to one where the segmenter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker #613

Tweak example of using a Chinese word segmenter to one where the segmenter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker #613

Conversation

ExplodingCabbage commented May 19, 2025

Uh oh!

Uh oh!