Skip to content

Tweak example of using a Chinese word segmenter to one where the segmenter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker #613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2025

Conversation

ExplodingCabbage
Copy link
Collaborator

See discussion at #539 (comment) for explanation

(Thanks @fisker!)

…enter segments in a way that seems more correct (or at least more self-consistent) to a native Chinese speaker

See discussion at #539 (comment) for explanation
@ExplodingCabbage ExplodingCabbage merged commit 0365f4b into master May 19, 2025
@ExplodingCabbage ExplodingCabbage deleted the better-chinese branch May 19, 2025 09:31
ExplodingCabbage added a commit that referenced this pull request May 22, 2025
Somehow I screwed up #613 and merged it with the tests failing and with the sentence I was actually using in the test inconsistent with the one I claimed to be using in the comment above. Also, even if I'd got it right, I wouldn't've actually avoided hitting the inconsistency in Intl.Segmenter's tokenization rules that that PR was specifically trying to avoid, because it considers 他有 (he has) to be one word; I should've used 她有 (she has) which the segmenter sees as two words. This fixes both mistakes.
ExplodingCabbage added a commit that referenced this pull request May 22, 2025
Somehow I screwed up #613 and merged it with the tests failing and with the sentence I was actually using in the test inconsistent with the one I claimed to be using in the comment above. Also, even if I'd got it right, I wouldn't've actually avoided hitting the inconsistency in Intl.Segmenter's tokenization rules that that PR was specifically trying to avoid, because it considers 他有 (he has) to be one word; I should've used 她有 (she has) which the segmenter sees as two words. This fixes both mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant