-
Notifications
You must be signed in to change notification settings - Fork 1.5k
chapter22_part21:/300_Aggregations/120_breadth_vs_depth.asciidoc #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
大伙看看有不准确的地方,欢迎指正
@@ -1,15 +1,10 @@ | |||
|
|||
=== Preventing Combinatorial Explosions | |||
=== 避免组合爆炸(Preventing Combinatorial Explosions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改为下面是否更容易理解?
优化聚合查询
@@ -1,15 +1,10 @@ | |||
|
|||
=== Preventing Combinatorial Explosions | |||
=== 避免组合爆炸(Preventing Combinatorial Explosions) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
开头地方可以加个译者注,为了方便读者理解es里面bucket的概念:
es里面bucket的叫法和SQL里面分组的概念是类似的,一个bucket就类似SQL里面的一个group
多级嵌套的aggregation,类似SQL里面的多字段分组(group by field1, field2, .....)
注意这里仅仅是概念类似,底层的实现原理是不一样的。
another aggregation, which contains another aggregation, and so forth. The combination of | ||
unique values in each of these aggregations can lead to an explosion in the | ||
number of buckets generated. | ||
`terms` 桶基于我们的数据动态构建桶;它并不知道到底生成了多少桶。((("combinatorial explosions, preventing")))((("aggregations", "preventing combinatorial explosions"))) 尽管这对单个聚合还行, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
尽管这对单个聚合还行
改为下面如何?
大多数时候对单个字段的聚合查询还是非常快的
unique values in each of these aggregations can lead to an explosion in the | ||
number of buckets generated. | ||
`terms` 桶基于我们的数据动态构建桶;它并不知道到底生成了多少桶。((("combinatorial explosions, preventing")))((("aggregations", "preventing combinatorial explosions"))) 尽管这对单个聚合还行, | ||
但考虑当一个聚合包含另外一个聚合,这样一层又一层的时候会发生什么。合并每个聚合的唯一值会导致它随着生成桶的数量而发生爆炸。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
但是当需要同时聚合多个字段时,就可能会产生大量的分组,最终结果就是占用es大量内存,从而导致OOM的情况发生。
|
||
Imagine we have a modest dataset that represents movies. Each document lists | ||
the actors in that movie: | ||
设想我们有一个表示影片大小适度的数据集合。每个文档都列出了影片的演员: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
假设我们现在有一些关于电影的数据集,每条数据里面会有一个数组类型的字段存储表演该电影的所有演员的名字。
combinatorial explosion of buckets and drastically reduces memory requirements | ||
for classes of queries that are amenable to breadth-first. | ||
因为我们已经知道了前十名演员,我们可以安全的修剪其他节点。修剪后,下一层是基于 _它的_ 执行模式读入的,重复执行这个过程直到聚合完成,如图 <<breadth-first-4>> 所示。 | ||
这就可以避免那种适于使用广度优先策略的查询,因为组合而导致桶的爆炸增长和内存急剧降低的问题。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这就可以避免那种适于使用广度优先策略的查询,因为组合而导致桶的爆炸增长和内存急剧降低的问题
改为
这种场景下,广度优先可以大幅度节省内存。
buckets, but each bucket has relatively few documents--breadth-first is much | ||
more memory efficient, and allows you to build aggregations that would | ||
otherwise fail. | ||
广度优先只有在当桶内的文档比可能生成的桶多时才应该被用到。深度搜索在桶层对文档数据缓存,然后在修剪阶段后的子聚合过程中再次使用这些文档缓存。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
广度优先只有在当桶内的文档比可能生成的桶多时才应该被用到。深度搜索在桶层对文档数据缓存,然后在修剪阶段后的子聚合过程中再次使用这些文档缓存。
更改为下面的是否更好?
广度优先仅仅适用于每个组的聚合数量远远小于当前总组数的情况下,因为广度优先会在内存中缓存裁剪后的仅仅需要缓存的每个组的所有数据,以便于它的子聚合分组查询可以复用上级聚合的数据。
|
||
在修剪之前,广度优先聚合对于内存的需求与每个桶内的文档数量成线性关系。对于很多聚合来说,每个桶内的文档数量是相当大的。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在修剪之前,广度优先聚合对于内存的需求与每个桶内的文档数量成线性关系
更改为下面的是否更好?
广度优先的内存使用情况与裁剪后的缓存分组数据量是成线性的
|
||
在修剪之前,广度优先聚合对于内存的需求与每个桶内的文档数量成线性关系。对于很多聚合来说,每个桶内的文档数量是相当大的。 | ||
想象一个以月为间隔的直方图:每个桶内可能有数以亿计的文档。这使广度优先不是一个好的选择,这也是为什么深度优先作为默认策略的原因。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+想象一个以月为间隔的直方图:每个桶内可能有数以亿计的文档。
更改为
想象一种按月分组的直方图,总组数肯定是固定的,因为每年只有12个月,这个时候每个月下的数据量可能非常大
|
||
但对于演员的示例,默认聚合生成大量的桶,但每个桶内的文档相对较少,而广度优先的内存效率更高。如果不是这样,我们构建的聚合要不然就会失败。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
但对于演员的示例,默认聚合生成大量的桶,但每个桶内的文档相对较少,而广度优先的内存效率更高。如果不是这样,我们构建的聚合要不然就会失败。
更改为
针对上面演员的例子,如果数据量越大,那么默认的使用深度优先的聚合模式生成的总分组数就会非常多,但是预估二级的聚合字段分组后的数据量相比总的分组数会小很多所以这种情况下使用广度优先的模式能大大节省内存,从而通过优化聚合模式来大大提高了在某些特定场景下聚合查询的成功率。
LGMT |
co-stars, totaling 50 values. To get the final results, we have to generate | ||
that tree of 2,000,000 buckets, sort it, and finally prune it such that only the | ||
top 10 actors are left. This is illustrated in <<depth-first-2>> and <<depth-first-3>>. | ||
No现在,记住,聚合只是简单的希望得到前十位演员和与他们联合出演者,总共 50 条数据。为了得到最终的结果,我们创建了一个有 2,000,000 桶的树,然后对其排序,取 top10。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
开头的No
@@ -1,15 +1,13 @@ | |||
[[_preventing_combinatorial_explosions]] | |||
=== 优化聚合查询(Preventing Combinatorial Explosions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
标题有点长
按 review 意见修改。1、移除标题括号中的英文内容。2、移除 No
LGTM |
初译