You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to reorder characters to implement this Chinese collation because
of the CLDR rule [reorder Han]. We divide all Unicode characters into five
parts:
1. The core group (spaces and symbols). We don't change the weight of the
characters in this group. They sort before all other characters as in
the DUCET.
2. 41336 Han characters whose sorting order have been defined by CLDR. These
characters sort after the characters of part 1.
3. All other Han characters. These characters sort after the Han characters
of part 2.
4. Character groups which are between the core group and the Han group in
the DUCET. We need to give them bigger weight than all Han characters.
So they sort after the characters of part 3.
5. All other characters.
Both CLDR v29 and v30 are incomplete and are missing some very common Han
characters (like “small”). Thus we will use the zh.xml file from CLDR v33
to implement this collation.
Changed uca9-dump.cc to make uca9dump can generate weight table file for
Chinese and Japanese languages at build time.
Chinese collation regression test added.
Benchmark result comparing to the Japanese collation:
BM_Chinese_AS_CS 18162 ns/iter 25.20 MB/sec
BM_Japanese_AS_CS 21975 ns/iter 14.06 MB/sec
Benchmark result showing its effect to other collations:
BM_SimpleUTF8MB4 2199 -> 2157 ns/iter [+ 1.95%]
BM_MixedUTF8MB4 1703 -> 1707 ns/iter [- 0.23%]
BM_MixedUTF8MB4_AS_CI 3523 -> 3409 ns/iter [+ 3.34%]
BM_MixedUTF8MB4_AS_CS 5065 -> 5049 ns/iter [+ 0.32%]
BM_JapaneseUTF8MB4 3659 -> 3693 ns/iter [- 0.92%]
BM_Hungarian_AS_CS 36518 -> 37603 ns/iter [- 2.89%]
BM_Japanese_AS_CS 21684 -> 21880 ns/iter [- 0.90%]
BM_Japanese_AS_CS_KS 29542 -> 29622 ns/iter [- 0.27%]
Change-Id: I70c3bd971c4d45ca255b8cd3406535e953e60d56
0 commit comments