Skip to content

Commit 13d63eb

Browse files
authored
Merge pull request #1304 from aws/varunkn/metrics-docs
Metrics: Add ReadMe with tenets, metrics list to collect, design doc.
2 parents 0611fe4 + bb94adc commit 13d63eb

11 files changed

+811
-0
lines changed

docs/design/core/metrics/Design.md

Lines changed: 280 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
## Concepts
2+
### Metric
3+
* A representation of data collected
4+
* Metric can be one of the following types: Counter, Gauge, Timer
5+
* Metric can be associated to a category. Some of the metric categories are Default, HttpClient, Streaming etc
6+
7+
### MetricRegistry
8+
9+
* A MetricRegistry represent an interface to store the collected metric data. It can hold different types of Metrics
10+
described above
11+
* MetricRegistry is generic and not tied to specific category (ApiCall, HttpClient etc) of metrics.
12+
* Each API call has it own instance of a MetricRegistry. All metrics collected in the ApiCall lifecycle are stored in
13+
that instance.
14+
* A MetricRegistry can store other instances of same type. This can be used to store metrics for each Attempt in an Api
15+
Call.
16+
* [Interface prototype](prototype/MetricRegistry.java)
17+
18+
### MetricPublisher
19+
20+
* A MetricPublisher represent an interface to publish the collected metrics to a external source.
21+
* SDK provides implementations to publish metrics to services like [Amazon
22+
CloudWatch](https://aws.amazon.com/cloudwatch/), [Client Side
23+
Monitoring](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/sdk-metrics.html) (also known as AWS SDK
24+
Metrics for Enterprise Support)
25+
* Customers can implement the interface and register the custom implementation to publish metrics to a platform not
26+
supported in the SDK.
27+
* MetricPublishers can have different behaviors in terms of list of metrics to publish, publishing frequency,
28+
configuration needed to publish etc.
29+
* Metrics can be explicitly published to the platform by calling publish() method. This can be useful in scenarios when
30+
the application fails and customer wants to flush metrics before exiting the application.
31+
* [Interface prototype](prototype/MetricPublisher.java)
32+
33+
### Reporting
34+
35+
* Reporting is transferring the collected metrics to Publishers.
36+
* To report metrics to a publisher, call the registerMetrics(MetricRegistry) method on the MetricPublisher.
37+
* There is no requirement for Publisher to publish the reported metrics immediately after calling this method.
38+
39+
40+
## Enabling Metrics
41+
42+
Metrics feature is disabled by default. Metrics can be enabled at client level in the following ways.
43+
44+
### Feature Flags (Metrics Provider)
45+
46+
* SDK exposes an [interface](prototype/MetricConfigurationProvider.java) to enable the metrics feature and specify
47+
options to configure the metrics behavior.
48+
* SDK provides an implementation of this interface based on system properties.
49+
* Here are the system properties SDK supports:
50+
- **aws.javasdk2x.metrics.enabled** - Metrics feature is enabled if this system property is set
51+
- **aws.javasdk2x.metrics.category** - Comma separated set of MetricCategory that are enabled for collection
52+
* SDK calls the methods in this interface for each request ie, enabled() method is called for every request to determine
53+
if the metrics feature is enabled or not (similarly for other configuration options).
54+
- This allows customers to control metrics behavior in a more flexible manner; for example using an external database
55+
like DynamoDB to dynamically control metrics collection. This is useful to enable/disable metrics feature and
56+
control metrics options at runtime without the need to make code changes or re-deploy the application.
57+
* As the interface methods are called for each request, it is recommended for the implementations to run expensive tasks
58+
asynchronously in the background, cache the results and periodically refresh the results.
59+
60+
```java
61+
ClientOverrideConfiguration config = ClientOverrideConfiguration
62+
.builder()
63+
// If this is not set, SDK uses the default chain with system property
64+
.metricConfigurationProvider(new SystemSettingsMetricConfigurationProvider())
65+
.build();
66+
67+
// Set the ClientOverrideConfiguration instance on the client builder
68+
CodePipelineAsyncClient asyncClient =
69+
CodePipelineAsyncClient
70+
.builder()
71+
.overrideConfiguration(config)
72+
.build();
73+
```
74+
75+
### Metrics Provider Chain
76+
77+
* Customers might want to have different ways of enabling the metrics feature. For example: use SystemProperties by
78+
default. If not use implementation based on Amazon DynamoDB.
79+
* To support multiple providers, SDK allows setting chain of providers (similar to the CredentialsProviderChain to
80+
resolve credentials). As provider has multiple configuration options, a single provider is resolved at chain
81+
construction time and it is used throughout the lifecycle of the application to keep the behavior intuitive.
82+
* If no custom chain is provided, SDK will use a default chain while looks for the System properties defined in above
83+
section. SDK can add more providers in the default chain in the future without breaking customers.
84+
85+
```java
86+
MetricConfigurationProvider chain = new MetricConfigurationProviderChain(
87+
new SystemSettingsMetricConfigurationProvider(),
88+
// example custom implementation (not provided by the SDK)
89+
DynamoDBMetricConfigurationProvider.builder()
90+
.tableName(TABLE_NAME)
91+
.enabledKey(ENABLE_KEY_NAME)
92+
...
93+
.build(),
94+
);
95+
96+
ClientOverrideConfiguration config = ClientOverrideConfiguration
97+
.builder()
98+
// If this is not set, SDK uses the default chain with system property
99+
.metricConfigurationProvider(chain)
100+
.build();
101+
102+
// Set the ClientOverrideConfiguration instance on the client builder
103+
CodePipelineAsyncClient asyncClient =
104+
CodePipelineAsyncClient
105+
.builder()
106+
.overrideConfiguration(config)
107+
.build();
108+
```
109+
110+
### Metric Publishers Configuration
111+
112+
* If metrics are enabled, SDK by default uses a single publisher that uploads metrics to CloudWatch using default
113+
credentials and region.
114+
* Customers might want to use different configuration for the CloudWatch publisher or even use a different publisher to
115+
publish to a different source. To provide this flexibility, SDK exposes an option to set
116+
[MetricPublisherConfiguration](prototype/MetricPublisherConfiguration.java) which can be used to configure custom
117+
publishers.
118+
* SDK publishes the collected metrics to each of the configured publishers in the MetricPublisherConfiguration.
119+
120+
```java
121+
ClientOverrideConfiguration config = ClientOverrideConfiguration
122+
.builder()
123+
.metricPublisherConfiguration(MetricPublisherConfiguration
124+
.builder()
125+
.addPublisher(
126+
CloudWatchPublisher.builder()
127+
.credentialsProvider(...)
128+
.region(Region.AP_SOUTH_1)
129+
.publishFrequency(5, TimeUnit.MINUTES)
130+
.build(),
131+
CsmPublisher.create()).bu
132+
.build())
133+
.build();
134+
135+
// Set the ClientOverrideConfiguration instance on the client builder
136+
CodePipelineAsyncClient asyncClient =
137+
CodePipelineAsyncClient
138+
.builder()
139+
.overrideConfiguration(config)
140+
.build();
141+
```
142+
143+
144+
## Modules
145+
New modules are created to support metrics feature.
146+
147+
### metrics-spi
148+
* Contains the metrics interfaces and default implementations that don't require other dependencies
149+
* This is a sub module under `core`
150+
* `sdk-core` has a dependency on `metrics-spi`, so customers will automatically get a dependency on this module.
151+
152+
### metrics-publishers
153+
* This is a new module that contains implementations of all SDK supported publishers
154+
* Under this module, a new sub-module is created for each publisher (`cloudwatch-publisher`, `csm-publisher`)
155+
* Customers has to **explicitly add dependency** on these modules to use the sdk provided publishers
156+
157+
158+
## Sequence Diagram
159+
160+
<b>Metrics Collection</b>
161+
162+
<div style="text-align: center;">
163+
164+
![Metrics Collection](images/MetricCollection.jpg)
165+
166+
</div>
167+
168+
<b>MetricPublisher</b>
169+
170+
<div style="text-align: center;">
171+
172+
![MetricPublisher fig.align="left"](images/MetricPublisher.jpg)
173+
174+
</div>
175+
176+
1. Client enables metrics feature through MetricConfigurationProvider and configure publishers through
177+
MetricPublisherConfiguration.
178+
2. For each API call, a new MetricRegistry object is created and stored in the ExecutionAttributes. If metrics are not
179+
enabled, a NoOpMetricRegistry is used.
180+
3. At each metric collection point, the metric is registered in the MetricRegistry object if its category is enabled in
181+
MetricConfigurationProvider.
182+
4. The metrics that are collected once for a Api Call execution are stored in the METRIC_REGISTRY ExecutionAttribute.
183+
5. The metrics that are collected per Api Call attempt are stored in new MetricRegistry instances which are part of the
184+
ApiCall MetricRegistry. These MetricRegistry instance for the current attempt is also accessed through
185+
ATTEMPT_METRIC_REGISTRY ExecutionAttribute.
186+
6. At end of API call, report the MetricRegistry object to MetricPublishers by calling registerMetrics(MetricRegistry)
187+
method. This is done in an ExecutionInterceptor.
188+
7. Steps 2 to 6 are repeated for each API call
189+
8. MetricPublisher calls publish() method to report metrics to external sources. The frequency of publish() method call
190+
is unique to Publisher implementation.
191+
9. Client has access to all registered publishers and it can call publish() method explicitly if desired.
192+
193+
194+
<b>CloudWatch MetricPublisher</b>
195+
196+
<div style="text-align: center;">
197+
198+
![CloudWatch MetricPublisher](images/CWMetricPublisher.jpg)
199+
200+
</div>
201+
202+
## Implementation Details
203+
Few important implementation details are discussed in this section.
204+
205+
SDK modules can be organized as shown in this image.
206+
207+
<div style="text-align: center;">
208+
209+
![Module Hierarchy](images/MetricsModulesHierarchy.png)
210+
211+
</div>
212+
213+
* Core modules - Modules in the core directory while have access to ExecutionContext and ExecutionAttributes
214+
* Downstream modules - Modules where execution occurs after core modules. For example, http-clients is downstream module
215+
as the request is transferred from core to http client for further execution.
216+
* Upstream modules - Modules that live in layers above core. Examples are High Level libraries (HLL) or Applications
217+
that use SDK. Execution goes from Upstream modules to core modules.
218+
219+
### Core Modules
220+
* SDK will use ExecutionAttributes to pass the MetricConfigurationProvider information through out the core module where
221+
core request-response metrics are collected.
222+
* Instead of checking whether metrics is enabled at each metric collection point, SDK will use the instance of
223+
NoOpMetricRegistry (if metrics are disabled) and DefaultMetricRegistry (if metrics are enabled).
224+
* The NoOpMetricRegistry class does not collect or store any metric data. Instead of creating a new NoOpMetricRegistry
225+
instance for each request, use the same instance for every request to avoid additional object creation.
226+
* The DefaultMetricRegistry class will only collect metrics if they belong to the MetricCategory list provided in the
227+
MetricConfigurationProvider. To support this, DefaultMetricRegistry is decorated by another class to filter metric
228+
categories that are not set in MetricConfigurationProvider.
229+
230+
### Downstream Modules
231+
* The MetricRegistry object and other required metric configuration details will be passed to the classes in downstream
232+
modules.
233+
* For example, HttpExecuteRequest for sync http client, AsyncExecuteRequest for async http client.
234+
* Downstream modules record the metric data directly into the given MetricRegistry object.
235+
* As we use same MetricRegistry object for core and downstream modules, both metrics will be reported to the Publisher
236+
together.
237+
238+
### Upstream Modules
239+
* As MetricRegistry object is created after the execution is passed from Upstream modules, these modules won't be able
240+
to modify/add to the core metrics.
241+
* If upstream modules want to report additional metrics using the registered publishers, they would need to create
242+
MetricRegistry instances and explicitly call the methods on the Publishers.
243+
* It would be useful to get the low-level API metrics in these modules, so SDK will expose APIs to get an immutable
244+
version of the MetricRegistry object so that upstream classes can use that information in their metric calculation.
245+
246+
### Reporting
247+
* Collected metrics are reported to the configured publishers at the end of each Api Call by calling
248+
`registerMetrics(MetricRegistry)` method on MetricPublisher.
249+
* The MetricRegistry argument in the registerMetrics method will have data on the entire Api Call including retries.
250+
* This reporting is done in `MetricsExecutionInterceptor` via `afterExecution()` and `onExecutionFailure()` methods.
251+
* `MetricsExecutionInterceptor` will always be the last configured ExecutionInterceptor in the interceptor chain
252+
253+
254+
## Performance
255+
One of the main tenet for metrics is “Enabling default metrics should have minimal impact on the application
256+
performance". The following design choices are made to ensure enabling metrics does not effect performance
257+
significantly.
258+
* When collecting metrics, a NoOpRegistry is used if metrics are disabled. All methods in this registry are no-op and
259+
return immediately. This also has the additional benefit of avoid metricsEnabled check at each metric collection
260+
point.
261+
* Metric publisher implementations can involve network calls and impact latency if done in blocking way. So all SDK
262+
publisher implementation will process the metrics asynchronously and does not block the actual request.
263+
264+
265+
## Testing
266+
267+
To ensure performance is not impacted due to metrics, tests should be written with various scenarios and a baseline for
268+
overhead should be created. These tests should be run regularly to catch regressions.
269+
270+
### Test Cases
271+
272+
SDK will be tested under load for each of these test cases using the load testing framework we already have. Each of
273+
these test case results should be compared with metrics feature disabled & enabled, and then comparing the results.
274+
275+
1. Enable each metrics publisher (CloudWatch, CSM) individually.
276+
2. Enable all metrics publishers.
277+
3. Individually enable each metric category to find overhead for each MetricCategory.
278+
279+
280+

0 commit comments

Comments
 (0)