Skip to content

Commit 5c7e9a9

Browse files
authored
Merge pull request #146 from elasticsearch-cn/revert-144-chapter/chapter24_part6
Revert "chapter24_part6: /270_Fuzzy_matching/60_Phonetic_matching.asciidoc"
2 parents ef1fc4c + 8a7efca commit 5c7e9a9

File tree

1 file changed

+52
-40
lines changed

1 file changed

+52
-40
lines changed

270_Fuzzy_matching/60_Phonetic_matching.asciidoc

Lines changed: 52 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,35 @@
11
[[phonetic-matching]]
2-
=== 语音匹配
3-
4-
最后,在尝试任何其他匹配方法都无效后,我们可以求助于搜索发音相似的词,即使他们的拼写不同。
5-
6-
7-
存在一些将词转换成语音标识的算法。
8-
((("phonetic algorithms"))) http://en.wikipedia.org/wiki/Soundex[Soundex] 算法是这些算法的鼻祖,
9-
而且大多数语音算法是 Soundex 的改进或者专业版本,例如 http://en.wikipedia.org/wiki/Metaphone[Metaphone]
10-
和 http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone[Double Metaphone] (扩展了除英语以外的其他语言的语音匹配),
11-
http://en.wikipedia.org/wiki/Caverphone[Caverphone] 算法匹配了新西兰的名称,
12-
https://en.wikipedia.org/wiki/Daitch–Mokotoff_Soundex#Beider.E2.80.93Morse_Phonetic_Name_Matching_Algorithm[Beider-Morse] 算法吸收了 Soundex 算法为了更好的匹配德语和依地语名称,
13-
http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik[Kölner Phonetik] 为了更好的处理德语词汇。
14-
15-
16-
值得一提的是,语音算法是相当简陋的,((("languages", "phonetic algorithms")))他们设计初衷针对的语言通常是英语或德语。这限制了他们的实用性。
17-
不过,为了某些明确的目标,并与其他技术相结合,语音匹配能够作为一个有用的工具。
18-
19-
20-
首先,你将需要从
21-
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html 获取在集群的每个节点安装((("Phonetic Analysis plugin")))语言分析器插件,
22-
并且重启每个节点。
23-
24-
25-
然后,您可以创建一个使用语音语汇单元过滤器的自定义分析器,并尝试下面的方法:
2+
=== Phonetic Matching
3+
4+
In a last, desperate, attempt to match something, anything, we could resort to
5+
searching for words that sound similar, ((("typoes and misspellings", "phonetic matching")))((("phonetic matching")))even if their spelling differs.
6+
7+
Several algorithms exist for converting words into a phonetic
8+
representation.((("phonetic algorithms"))) The http://en.wikipedia.org/wiki/Soundex[Soundex] algorithm is
9+
the granddaddy of them all, and most other phonetic algorithms are
10+
improvements or specializations of Soundex, such as
11+
http://en.wikipedia.org/wiki/Metaphone[Metaphone] and
12+
http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone[Double Metaphone]
13+
(which expands phonetic matching to languages other than English),
14+
http://en.wikipedia.org/wiki/Caverphone[Caverphone] for matching names in New
15+
Zealand, the
16+
https://en.wikipedia.org/wiki/Daitch–Mokotoff_Soundex#Beider.E2.80.93Morse_Phonetic_Name_Matching_Algorithm[Beider-Morse] algorithm, which adopts the Soundex algorithm
17+
for better matching of German and Yiddish names, and the
18+
http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik[Kölner Phonetik] for better
19+
handling of German words.
20+
21+
The thing to take away from this list is that phonetic algorithms are fairly
22+
crude, and ((("languages", "phonetic algorithms")))very specific to the languages they were designed for, usually
23+
either English or German. This limits their usefulness. Still, for certain
24+
purposes, and in combination with other techniques, phonetic matching can be a
25+
useful tool.
26+
27+
First, you will need to install ((("Phonetic Analysis plugin")))the Phonetic Analysis plug-in from
28+
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html on every node
29+
in the cluster, and restart each node.
30+
31+
Then, you can create a custom analyzer that uses one of the
32+
phonetic token filters ((("phonetic matching", "creating a phonetic analyzer")))and try it out:
2633

2734
[source,json]
2835
-----------------------------------
@@ -46,25 +53,26 @@ PUT /my_index
4653
}
4754
}
4855
-----------------------------------
49-
<1> 首先,配置一个自定义 `phonetic` 语汇单元过滤器并使用 `double_metaphone` 编码器。
50-
<2> 然后在自定义分析器中使用自定义语汇单元过滤器。
56+
<1> First, configure a custom `phonetic` token filter that uses the
57+
`double_metaphone` encoder.
58+
<2> Then use the custom token filter in a custom analyzer.
5159

60+
Now we can test it with the `analyze` API:
5261

53-
现在我们可以通过 `analyze` API 来进行测试:
5462

5563
[source,json]
5664
-----------------------------------
5765
GET /my_index/_analyze?analyzer=dbl_metaphone
5866
Smith Smythe
5967
-----------------------------------
6068

69+
Each of `Smith` and `Smythe` produce two tokens in the same position: `SM0`
70+
and `XMT`. Running `John`, `Jon`, and `Johnnie` through the analyzer will all
71+
produce the two tokens `JN` and `AN`, while `Jonathon` results in the tokens
72+
`JN0N` and `ANTN`.
6173

62-
每个 `Smith` 和 `Smythe` 在同一位置产生两个语汇单元: `SM0` 和 `XMT` 。
63-
通过分析器播放 `John` , `Jon` 和 `Johnnie` 将产生两个语汇单元 `JN` 和 `AN` ,而 `Jonathon` 产生语汇单元 `JN0N` 和 `ANTN` 。
64-
65-
66-
语音分析器可以像任何其他分析器一样使用。 首先映射一个字段来使用它,然后索引一些数据:
67-
74+
The phonetic analyzer can be used just like any other analyzer. First map a
75+
field to use it, and then index some data:
6876

6977
[source,json]
7078
-----------------------------------
@@ -93,10 +101,9 @@ PUT /my_index/my_type/2
93101
"name": "Jonnie Smythe"
94102
}
95103
-----------------------------------
96-
<1> `name.phonetic` 字段使用自定义 `dbl_metaphone` 分析器。
97-
104+
<1> The `name.phonetic` field uses the custom `dbl_metaphone` analyzer.
98105

99-
可以使用 `match` 查询来进行搜索:
106+
The `match` query can be used for searching:
100107

101108
[source,json]
102109
-----------------------------------
@@ -113,10 +120,15 @@ GET /my_index/my_type/_search
113120
}
114121
-----------------------------------
115122

123+
This query returns both documents, demonstrating just how coarse phonetic
124+
matching is. ((("phonetic matching", "purpose of"))) Scoring with a phonetic algorithm is pretty much worthless. The
125+
purpose of phonetic matching is not to increase precision, but to increase
126+
recall--to spread the net wide enough to catch any documents that might
127+
possibly match.((("recall", "increasing with phonetic matching")))
128+
129+
It usually makes more sense to use phonetic algorithms when retrieving results
130+
which will be consumed and post-processed by another computer, rather than by
131+
human users.
116132

117-
这个查询返回全部两个文档,演示了如何进行简陋的语音匹配。
118-
((("phonetic matching", "purpose of"))) 用语音算法计算评分是没有价值的。
119-
语音匹配的目的不是为了提高精度,而是要提高召回率--以扩展足够的范围来捕获可能匹配的文档。
120133

121134

122-
通常是更有意义的使用语音算法是在检索到结果后,由另一台计算机进行消费和后续处理,而不是由人类用户直接使用。

0 commit comments

Comments
 (0)