Skip to content

chapter22_part21:/300_Aggregations/120_breadth_vs_depth.asciidoc #294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 22, 2016
93 changes: 26 additions & 67 deletions 300_Aggregations/120_breadth_vs_depth.asciidoc
Original file line number Diff line number Diff line change
@@ -1,15 +1,10 @@

=== Preventing Combinatorial Explosions
=== 避免组合爆炸(Preventing Combinatorial Explosions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改为下面是否更容易理解?
优化聚合查询


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

开头地方可以加个译者注,为了方便读者理解es里面bucket的概念:
es里面bucket的叫法和SQL里面分组的概念是类似的,一个bucket就类似SQL里面的一个group
多级嵌套的aggregation,类似SQL里面的多字段分组(group by field1, field2, .....)
注意这里仅仅是概念类似,底层的实现原理是不一样的。

The `terms` bucket dynamically builds buckets based on your data; it doesn't
know up front how many buckets will be generated. ((("combinatorial explosions, preventing")))((("aggregations", "preventing combinatorial explosions"))) While this is fine with a
single aggregation, think about what can happen when one aggregation contains
another aggregation, which contains another aggregation, and so forth. The combination of
unique values in each of these aggregations can lead to an explosion in the
number of buckets generated.
`terms` 桶基于我们的数据动态构建桶;它并不知道到底生成了多少桶。((("combinatorial explosions, preventing")))((("aggregations", "preventing combinatorial explosions"))) 尽管这对单个聚合还行,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尽管这对单个聚合还行
改为下面如何?
大多数时候对单个字段的聚合查询还是非常快的

但考虑当一个聚合包含另外一个聚合,这样一层又一层的时候会发生什么。合并每个聚合的唯一值会导致它随着生成桶的数量而发生爆炸。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但是当需要同时聚合多个字段时,就可能会产生大量的分组,最终结果就是占用es大量内存,从而导致OOM的情况发生。


Imagine we have a modest dataset that represents movies. Each document lists
the actors in that movie:
设想我们有一个表示影片大小适度的数据集合。每个文档都列出了影片的演员:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

假设我们现在有一些关于电影的数据集,每条数据里面会有一个数组类型的字段存储表演该电影的所有演员的名字。


[source,js]
----
Expand All @@ -22,8 +17,7 @@ the actors in that movie:
}
----

If we want to determine the top 10 actors and their top costars, that's trivial
with an aggregation:
如果我们想要确定出演影片最多的十个演员以及与他们合作最多的演员,使用聚合并不算什么:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确定 改为 查询 会不会比较容易理解
使用聚合并不算什么 改为 这是非常简单的


[source,js]
----
Expand All @@ -47,28 +41,19 @@ with an aggregation:
}
----

This will return a list of the top 10 actors, and for each actor, a list of their
top five costars. This seems like a very modest aggregation; only 50
values will be returned!
这会返回前十位出演最多的演员,以及与他们合作最多的五位演员。这似乎是个不大的聚合,只返回 50 个值!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这似乎是个不大的聚合,只返回 50 个值!
改为下面的是否更通顺
这看起来是一个简单的聚合查询,最终只返回50条数据


However, this seemingly ((("aggregations", "fielddata", "datastructure overview")))innocuous query can easily consume a vast amount of
memory. You can visualize a `terms` aggregation as building a tree in memory.
The `actors` aggregation will build the first level of the tree, with a bucket
for every actor. Then, nested under each node in the first level, the
`costars` aggregation will build a second level, with a bucket for every costar, as seen in <<depth-first-1>>. That means that a single movie will generate n^2^ buckets!
但是,((("aggregations", "fielddata", "datastructure overview"))) 这个看上去无伤大雅的查询可以轻而易举地消耗大量内存,我们可以通过在内存中构建一个树来查看这个 `terms` 聚合。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无伤大雅
改为
简单

`actors` 聚合会构建树的第一层,每个演员都有一个桶。然后,内套在第一层的每个节点之下, `costar` 聚合会构建第二层,每个联合出演一个桶,请参见 <<depth-first-1>> 所示。这意味着每部影片会生成 n^2^ 个桶!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n^2^
改为
n^2


[[depth-first-1]]
.Build full depth tree
image::images/300_120_depth_first_1.svg["Build full depth tree"]

To use some real numbers, imagine each movie has 10 actors on average. Each movie
will then generate 10^2^ == 100 buckets. If you have 20,000 movies, that's
roughly 2,000,000 generated buckets.
用真实点的数字,设想平均每部影片有 10 名演员,每部影片就会生成 10^2^ == 100 个桶。如果总共有 20,000 部影片,粗率计算就会生成 2,000,000 个桶。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数字
改为
数据


Now, remember, our aggregation is simply asking for the top 10 actors and their
co-stars, totaling 50 values. To get the final results, we have to generate
that tree of 2,000,000 buckets, sort it, and finally prune it such that only the
top 10 actors are left. This is illustrated in <<depth-first-2>> and <<depth-first-3>>.
No现在,记住,聚合只是简单的希望得到前十位演员和与他们联合出演者,总共 50 个值。为了得到最终的结果,我们创建了一个有 2,000,000 桶的树,然后对其排序,最后将结果减少到前 10 位演员。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50个值,改为50条数据
最后将结果减少到前 10 位演员,改为 取top10

图 <<depth-first-2>> 和图 <<depth-first-3>> 对这个过程进行了阐述。

[[depth-first-2]]
.Sort tree
Expand All @@ -78,30 +63,19 @@ image::images/300_120_depth_first_2.svg["Sort tree"]
.Prune tree
image::images/300_120_depth_first_3.svg["Prune tree"]

At this point you should be quite distraught. Twenty thousand documents is paltry,
and the aggregation is pretty tame. What if you had 200 million documents, wanted
the top 100 actors and their top 20 costars, as well as the costars' costars?
这时我们一定非常抓狂,2 万文档虽然微不足道,但是聚合也不轻松。如果我们有 2 亿文档,想要得到前 100 位演员以及与他们合作最多的 20 位演员,以及合作者的合作者会怎样?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 万文档虽然微不足道,但是聚合也不轻松。
改为下面的是否更好
在2万条数据下执行任何聚合查询都是毫无压力的,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以及合作者的合作者会怎样?
改为下面的是否会更好
作为查询的最终结果会出现什么情况呢?


You can appreciate how quickly combinatorial expansion can grow, making this
strategy untenable. There is not enough memory in the world to support uncontrolled
combinatorial explosions.
可以判断组合扩大快速增长会使这种策略难以维持。世界上并不存在足够的内存来支持这种非受控状态下的组合爆炸。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以判断组合扩大快速增长
改为下面的事否更好?
可以推测聚合出来的分组数会非常大

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非受控状态下的组合爆炸。
改为下面的事否更好
不受控制的聚合查询


==== Depth-First Versus Breadth-First

Elasticsearch allows you to change the _collection mode_ of an aggregation, for
exactly this situation. ((("collection mode"))) ((("aggregations", "preventing combinatorial explosions", "depth-first versus breadth-first")))The strategy we outlined previously--building the tree fully
and then pruning--is called _depth-first_ and it is the default. ((("depth-first collection strategy"))) Depth-first
works well for the majority of aggregations, but can fall apart in situations
like our actors and costars example.
Elasticsearch 允许我们改变聚合的 _集合模式_ ,就是为了应对这种状况。((("collection mode"))) ((("aggregations", "preventing combinatorial explosions", "depth-first versus breadth-first")))
我们之前展示的策略叫做 _深度优先_ ,它是默认设置,((("depth-first collection strategy"))) 先构建完整的树,然后修剪无用节点。 _深度优先_ 的方式对于大多数聚合都能正常工作,但对于如我们演员和联合演员这样例子的情形就不太适用。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

然后修剪无用节点。后面 漏了 。它是默认的聚合模式

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个写了。我们之前展示的策略叫做 深度优先 ,它是默认设置,先构建完整的树,然后修剪无用节点。


For these special cases, you should use an alternative collection strategy called
_breadth-first_. ((("beadth-first collection strategy")))This strategy works a little differently. It executes the first
layer of aggregations, and _then_ performs a pruning phase before continuing, as illustrated in <<breadth-first-1>> through <<breadth-first-3>>.
为了应对这些特殊的应用场景,我们应该使用另一种集合策略叫做 _广度优先_ 。((("beadth-first collection strategy")))这种策略的工作方式有些不同,它先执行第一层聚合, _再_ 继续下一层聚合之前会先做修剪。
图 <<breadth-first-1>> 和图 <<breadth-first-3>> 对这个过程进行了阐述。

In our example, the `actors` aggregation would be executed first. At this
point, we have a single layer in the tree, but we already know who the top 10
actors are! There is no need to keep the other actors since they won't be in
the top 10 anyway.
在我们的示例中, `actors` 聚合会首先执行,在这个时候,我们的树只有一层,但我们已经知道了前 10 位的演员!这就没有必要保留其他的演员信息,因为它们无论如何都不会出现在前十位中。

[[breadth-first-1]]
.Build first level
Expand All @@ -115,17 +89,14 @@ image::images/300_120_breadth_first_2.svg["Sort first level"]
.Prune first level
image::images/300_120_breadth_first_3.svg["Prune first level"]

Since we already know the top ten actors, we can safely prune away the rest of the
long tail. After pruning, the next layer is populated based on _its_ execution mode,
and the process repeats until the aggregation is done, as illustrated in <<breadth-first-4>>. This prevents the
combinatorial explosion of buckets and drastically reduces memory requirements
for classes of queries that are amenable to breadth-first.
因为我们已经知道了前十名演员,我们可以安全的修剪其他节点。修剪后,下一层是基于 _它的_ 执行模式读入的,重复执行这个过程直到聚合完成,如图 <<breadth-first-4>> 所示。
这就可以避免那种适于使用广度优先策略的查询,因为组合而导致桶的爆炸增长和内存急剧降低的问题。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这就可以避免那种适于使用广度优先策略的查询,因为组合而导致桶的爆炸增长和内存急剧降低的问题
改为
这种场景下,广度优先可以大幅度节省内存。


[[breadth-first-4]]
.Populate full depth for remaining nodes
image::images/300_120_breadth_first_4.svg["Step 4: populate full depth for remaining nodes"]

To use breadth-first, simply ((("collect parameter, enabling breadth-first")))enable it via the `collect` parameter:
要使用广度优先,只需简单 ((("collect parameter, enabling breadth-first"))) 的通过参数 `collect` 开启:

[source,js]
----
Expand All @@ -149,23 +120,11 @@ To use breadth-first, simply ((("collect parameter, enabling breadth-first")))en
}
}
----
<1> Enable `breadth_first` on a per-aggregation basis.
<1> 按聚合来开启 `breadth_first`

Breadth-first should be used only when you expect more buckets to be generated
than documents landing in the buckets. Breadth-first works by caching
document data at the bucket level, and then replaying those documents to child
aggregations after the pruning phase.

The memory requirement of a breadth-first aggregation is linear to the number
of documents in each bucket prior to pruning. For many aggregations, the
number of documents in each bucket is very large. Think of a histogram with
monthly intervals: you might have thousands or hundreds of thousands of
documents per bucket. This makes breadth-first a bad choice, and is why
depth-first is the default.

But for the actor example--which generates a large number of
buckets, but each bucket has relatively few documents--breadth-first is much
more memory efficient, and allows you to build aggregations that would
otherwise fail.
广度优先只有在当桶内的文档比可能生成的桶多时才应该被用到。深度搜索在桶层对文档数据缓存,然后在修剪阶段后的子聚合过程中再次使用这些文档缓存。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

广度优先只有在当桶内的文档比可能生成的桶多时才应该被用到。深度搜索在桶层对文档数据缓存,然后在修剪阶段后的子聚合过程中再次使用这些文档缓存。
更改为下面的是否更好?
广度优先仅仅适用于每个组的聚合数量远远小于当前总组数的情况下,因为广度优先会在内存中缓存裁剪后的仅仅需要缓存的每个组的所有数据,以便于它的子聚合分组查询可以复用上级聚合的数据。


在修剪之前,广度优先聚合对于内存的需求与每个桶内的文档数量成线性关系。对于很多聚合来说,每个桶内的文档数量是相当大的。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在修剪之前,广度优先聚合对于内存的需求与每个桶内的文档数量成线性关系
更改为下面的是否更好?
广度优先的内存使用情况与裁剪后的缓存分组数据量是成线性的

想象一个以月为间隔的直方图:每个桶内可能有数以亿计的文档。这使广度优先不是一个好的选择,这也是为什么深度优先作为默认策略的原因。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+想象一个以月为间隔的直方图:每个桶内可能有数以亿计的文档。
更改为
想象一种按月分组的直方图,总组数肯定是固定的,因为每年只有12个月,这个时候每个月下的数据量可能非常大


但对于演员的示例,默认聚合生成大量的桶,但每个桶内的文档相对较少,而广度优先的内存效率更高。如果不是这样,我们构建的聚合要不然就会失败。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但对于演员的示例,默认聚合生成大量的桶,但每个桶内的文档相对较少,而广度优先的内存效率更高。如果不是这样,我们构建的聚合要不然就会失败。
更改为
针对上面演员的例子,如果数据量越大,那么默认的使用深度优先的聚合模式生成的总分组数就会非常多,但是预估二级的聚合字段分组后的数据量相比总的分组数会小很多所以这种情况下使用广度优先的模式能大大节省内存,从而通过优化聚合模式来大大提高了在某些特定场景下聚合查询的成功率。