Skip to content

Commit f8e3094

Browse files
authored
Merge pull request #142 from fanyer/chapter/chapter8_part3
chapter8_part3: /056_Sorting/90_What_is_relevance.asciidoc
2 parents 50173e8 + e29b6cb commit f8e3094

File tree

1 file changed

+87
-92
lines changed

1 file changed

+87
-92
lines changed
Lines changed: 87 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,59 @@
1-
[[relevance-intro]]
2-
=== What Is Relevance?
1+
[[相关性简介]]
2+
=== 什么是相关性?
33

4-
We've mentioned that, by default, results are returned in descending order of
5-
relevance.((("relevance", "defined"))) But what is relevance? How is it calculated?
64

7-
The relevance score of each document is represented by a positive floating-point number called the `_score`.((("score", "calculation of"))) The higher the `_score`, the more relevant
8-
the document.
95

10-
A query clause generates a `_score` for each document. How that score is
11-
calculated depends on the type of query clause.((("fuzzy queries", "calculation of relevence score"))) Different query clauses are
12-
used for different purposes: a `fuzzy` query might determine the `_score` by
13-
calculating how similar the spelling of the found word is to the original
14-
search term; a `terms` query would incorporate the percentage of terms that
15-
were found. However, what we usually mean by _relevance_ is the algorithm that we
16-
use to calculate how similar the contents of a full-text field are to a full-text query string.
6+
我们曾经讲过,默认情况下,返回结果是按照相关性倒序排序的,((("relevance", "defined")))但是什么是相关性?相关性如何计算
7+
178

18-
The standard _similarity algorithm_ used in Elasticsearch is((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithms", "Term Frequency/Inverse Document Frequency (TF/IDF)"))) known as _term
19-
frequency/inverse document frequency_, or _TF/IDF_, which takes the following
20-
factors into((("inverse document frequency"))) account:
219

22-
Term frequency::
2310

24-
How often does the term appear in the field? The more often, the more
25-
relevant. A field containing five mentions of the same term is more likely
26-
to be relevant than a field containing just one mention.
11+
每个文档都会有相关性评分,用一个正浮点数 `_score` 来表示, `_scaore` 的评分越高,相关性越高。
2712
28-
Inverse document frequency::
2913
30-
How often does each term appear in the index? The more often, the _less_
31-
relevant. Terms that appear in many documents have a lower _weight_ than
32-
more-uncommon terms.
3314
34-
Field-length norm::
15+
查询子句会为每个文档添加一个 `_score` 字段,评分的计算方式取决于不同的查询类型———不同的查询子句用于不同的查询目的。((("fuzzy queries", "calculation of relevence score"))) 一个 `fuzzy`
16+
查询会计算与关键词的拼写相似程度, `terms` 查询会计算找到的内容于关键词组成部分匹配的百分比,但是一般意义上我们说的全文本搜索是指计算内容与关键词的类似程度。
3517
36-
How long is the field? The longer it is, the less likely it is that words in
37-
the field will be relevant. A term appearing in a short `title` field
38-
carries more weight than the same term appearing in a long `content` field.
3918
40-
Individual ((("field-length norm")))queries may combine the TF/IDF score with other factors
41-
such as the term proximity in phrase queries, or term similarity in
42-
fuzzy queries.
19+
Elasticsearch 的相似度算法被定义为 TF/IDF ,((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithms", "Term Frequency/Inverse Document Frequency (TF/IDF)")))即检索词频率/反向文档频率,包括((("inverse document frequency")))以下内容:
4320
44-
Relevance is not just about full-text search, though. It can equally be applied
45-
to yes/no clauses, where the more clauses that match, the higher the
46-
`_score`.
4721
48-
When multiple query clauses are combined using a compound query((("compound query clauses", "relevance score for results"))) like the
49-
`bool` query, the `_score` from each of these query clauses is combined to
50-
calculate the overall `_score` for the document.
22+
检索词频率::
23+
24+
25+
检索词在该字段出现的频率?出现频率越高,相关性越高。字段中出现5次相同的检索词要比只出现一次的相关性高。
26+
27+
反向文档频率::
28+
29+
每个检索词在索引中出现的频率?出现的频率越高,相关性也越高。检索词出现在多数文档中的会比出现在少数文档中的权重更低,即检验一个检索词在文档中的普遍重要性。
30+
31+
字段长度准则::
32+
33+
34+
字段的长度是多少?长度越长,相关性越低。检索词出现在一个短的 `title` 要比同样的词出现在一个长的 `content` 字段相关性更高。
35+
36+
37+
单个查询((("field-length norm")))可以使用 TF/IDF 评分标准或其他方式,比如在短语查询中检索词的距离或模糊查询中检索词的相似度。
38+
39+
40+
41+
虽然如此,相关性不仅仅关于全文搜索,也适用于 yes/no 子句, 匹配的字句越多,相关性评分越高。
42+
43+
44+
45+
当多条查询子句被合并为一条复合子句时,((("compound query clauses", "relevance score for results"))) 例如 `bool` 查询,则每个查询子句计算得出的得分会被合并到总的相关性评分中。
46+
47+
48+
TIP: 我们有了一整章关于相关性计算和如何使其按照你所希望的方式运作:<<controlling-relevance>>.
5149
52-
TIP: We have a whole chapter dedicated to relevance calculations and how to
53-
bend them to your will: <<controlling-relevance>>.
5450
5551
[[explain]]
56-
==== Understanding the Score
52+
==== 理解评分标准
53+
5754
58-
When debugging a complex query,((("score", "calculation of")))((("relevance scores", "understanding"))) it can be difficult to understand
59-
exactly how a `_score` has been calculated. Elasticsearch
60-
has the option of producing an _explanation_ with every search result,
61-
by setting the `explain` parameter((("explain parameter"))) to `true`.
6255
56+
当调试一个复杂的查询语句时, 想要理解相关性评分会比较困难。Elasticsearch在每个查询语句中都会生成 _explanation_ 选项,将 `explain` 参数设置为 `true` 就可以得到更详细的信息。
6357

6458
[source,js]
6559
--------------------------------------------------
@@ -69,18 +63,19 @@ GET /_search?explain <1>
6963
}
7064
--------------------------------------------------
7165
// SENSE: 056_Sorting/90_Explain.json
72-
<1> The `explain` parameter adds an explanation of how the `_score` was
73-
calculated to every result.
66+
<1> `explain` 参数 增加了对每个结果的 `_score` 评分是如何计算出来的。
7467

7568
[NOTE]
7669
====
77-
Adding `explain` produces a lot((("explain parameter", "for relevance score calculation"))) of output for every hit, which can look
78-
overwhelming, but it is worth taking the time to understand what it all means.
79-
Don't worry if it doesn't all make sense now; you can refer to this section
80-
when you need it. We'll work through the output for one `hit` bit by bit.
70+
71+
增加一个 `explain` 参数会为每个匹配到的文档产生一大堆额外内容,但是花时间去理解它是有意义的。如果现在看不明白也没关系———等你需要的时候再来回顾这一节就行/夏眠我们来一点点地了解这块知识点。
72+
73+
8174
====
8275

83-
First, we have the metadata that is returned on normal search requests:
76+
77+
首先,我么看一下普通查询返回的元数据。
78+
8479

8580
[source,js]
8681
--------------------------------------------------
@@ -92,20 +87,21 @@ First, we have the metadata that is returned on normal search requests:
9287
"_source" : { ... trimmed ... },
9388
--------------------------------------------------
9489

95-
It adds information about the shard and the node that the document came from,
96-
which is useful to know because term and document frequencies are calculated
97-
per shard, rather than per index:
90+
91+
92+
这里加入了文档来源的分片和节点的信息,这对我们是比较有帮助的,因为词频率和文档频率是在每个分片中计算出来的,而不是每个索引中。
93+
9894

9995
[source,js]
10096
--------------------------------------------------
10197
"_shard" : 1,
10298
"_node" : "mzIVYCsqSWCG_M_ZffSs9Q",
10399
--------------------------------------------------
104100

105-
Then it provides the `_explanation`. Each ((("explanation of relevance score calculation")))((("description", "of relevance score calculations")))entry contains a `description`
106-
that tells you what type of calculation is being performed, a `value`
107-
that gives you the result of the calculation, and the `details` of any
108-
subcalculations that were required:
101+
102+
103+
然后返回值中的 `_explanation_` 会包含在每一个入口,((("explanation of relevance score calculation")))((("description", "of relevance score calculations")))告诉你采用了哪种计算方式,并让你知道计算结果和我们需要的其他详情。
104+
109105

110106
[source,js]
111107
--------------------------------------------------
@@ -141,55 +137,54 @@ subcalculations that were required:
141137
]
142138
}
143139
--------------------------------------------------
144-
<1> Summary of the score calculation for `honeymoon`
145-
<2> Term frequency
146-
<3> Inverse document frequency
147-
<4> Field-length norm
140+
<1> `honeymoon` 相关性评分计算的总结
141+
<2> 检索词频率
142+
<3> 反向文档频率
143+
<4> 字段长度准则
144+
145+
WARNING: 输出 `explain` 的代价是昂贵的.((("explain parameter", "overhead of using"))) 它只能用作调试,而不要用于生产环境。
146+
147+
148+
第一部分是关于计算的总结。告诉了我们 文档 `0``honeymoon``tweet` 字段中的检索词频率/反向文档频率 (TF/IDF)((("weight", "calculation of")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "weight calculation for a term")))。(这里的文档 `0` 是一个内部的ID,跟我们没有任何关系,可以忽略)
148149

149-
WARNING: Producing the `explain` output is expensive.((("explain parameter", "overhead of using"))) It is a debugging tool
150-
only. Don't leave it turned on in production.
151150

152-
The first part is the summary of the calculation. It tells us that it has
153-
calculated the _weight_&#x2014;the ((("weight", "calculation of")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "weight calculation for a term")))TF/IDF--of the term `honeymoon` in the field `tweet`, for document `0`. (This is
154-
an internal document ID and, for our purposes, can be ignored.)
151+
然后给出了计算的权重计算出来的详情((("field-length norm")))((("inverse document frequency")))
155152

156-
It then provides details((("field-length norm")))((("inverse document frequency"))) of how the weight was calculated:
157153

158-
Term frequency::
154+
检索词频率::
159155

160-
How many times did the term `honeymoon` appear in the `tweet` field in
161-
this document?
156+
在本文档中检索词 `honeymoon` 在 `tweet` 字段中的出现次数。
162157

163-
Inverse document frequency::
158+
反向文档频率::
164159

165-
How many times did the term `honeymoon` appear in the `tweet` field
166-
of all documents in the index?
160+
在本索引中, 本文档 `honeymoon` 在 `tweet` 字段出现次数和其他文档中出现总数的比率。
161+
162+
163+
字段长度准则::
164+
165+
文档中 `tweet` 字段内容的长度——内容越长,其值越小
166+
167+
168+
169+
复杂的查询语句的解释也很复杂,但是包含的内容与上面例子大致相同。通过这段描述我们可以了解搜索结果的顺序是如何产生的,这些信息在我们调试时是无价的。
167170

168-
Field-length norm::
169171

170-
How long is the `tweet` field in this document? The longer the field,
171-
the smaller this number.
172172

173-
Explanations for more-complicated queries can appear to be very complex, but
174-
really they just contain more of the same calculations that appear in the
175-
preceding example. This information can be invaluable for debugging why search
176-
results appear in the order that they do.
177173

178174
[TIP]
179175
==================================================================
180-
The output from `explain` can be difficult to read in JSON, but it is easier
181-
when it is formatted as YAML.((("explain parameter", "formatting output in YAML")))((("YAML, formatting explain output in"))) Just add `format=yaml` to the query string.
176+
json形式的 `explain` 会非常难以阅读, 但是转成yaml会好很多。((("explain parameter", "formatting output in YAML")))((("YAML, formatting explain output in"))) 仅仅需要在查询参数中增加 `format=yaml` 。
182177
==================================================================
183178

184179

185180
[[explain-api]]
186-
==== Understanding Why a Document Matched
181+
==== 理解文档是如何被匹配到的
182+
183+
184+
当 `explain` 选项加到某一文档上时,他会告诉你为何这个文档会被匹配,以及一个文档为何没有被匹配。((("relevance", "understanding why a document matched")))((("explain API, understanding why a document matched")))
187185

188-
While the `explain` option adds an explanation for every result, you can use
189-
the `explain` API to understand why one particular document matched or, more
190-
important, why it _didn't_ match.((("relevance", "understanding why a document matched")))((("explain API, understanding why a document matched")))
191186

192-
The path for the request is `/index/type/id/_explain`, as in the following:
187+
请求路径为 `/index/type/id/_explain`, 如下所示:
193188

194189
[source,js]
195190
--------------------------------------------------
@@ -205,14 +200,14 @@ GET /us/tweet/12/_explain
205200
--------------------------------------------------
206201
// SENSE: 056_Sorting/90_Explain_API.json
207202

208-
Along with the full explanation((("description", "of why a document didn&#x27;t match"))) that we saw previously, we also now have a
209-
`description` element, which tells us this:
210203

204+
和我们之前看到的全部详情一起,我们现在有了一个 `element` 元素,并告知我们如下
211205

212206
[source,js]
213207
--------------------------------------------------
214208
"failure to match filter: cache(user_id:[2 TO 2])"
215209
--------------------------------------------------
216210

217-
In other words, our `user_id` filter clause is preventing the document from
218-
matching.
211+
212+
213+
换句话说,我们的 `user_id` 过滤器子句防止了文档被匹配到

0 commit comments

Comments
 (0)