-
Notifications
You must be signed in to change notification settings - Fork 421
Metrics not appearing at 1 minute resolution as expected #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @bml1g12 - thanks for raising this. As the metric timestamp is in the logs, there seems to be an issue with CloudWatch EMF backend processing them, or 1s resolution isn't supported (can't find in the docs) -- if the latter is correct, you can try switching to 1 minute to confirm if data points appear correctly in the console, as the Console will fallback to aggregate metrics that don't support lower resolutions. Could you please open up a support case with these metric objects in the logs and console screenshot? @jaredcnance - does this ring a bell as metrics are being logged correctly? I'll try to reproduce that this week. |
The AWS account that I experienced this issue within does not have a support plan sufficient for me to create a technical support case; but I could make one using a different account I have access to which does have this support if you think that would be wise? |
If you can create in another account, you'd need to reproduce this issue there too as Support cannot provide cross-account support (unless it's Enterprise Support). I pinged the CloudWatch team in the meantime :) |
That's a shame; I tried to reproduce this on a dummy lambda function but could not manage to reproduce the same issue so not confident I will be able to reproduce it in a separate account (as it would require running the whole end to end system), yet it persists in our production system so I think there is a genuine issue somewhere. Thanks for pinging the cloudwatch team, I appreciate any help on this one as it's puzzling. From your message above I gather powertools just logs a message to cloudwatch in the correct format for metrics; and then its the responsibility of the cloudwatch EMF backend to deal with it, and as such it cannot be a bug in the aws-lambda-powertools library itself? I am logging
So I can create a manual log filter to get the metric if needed but I was doing this before I discovered aws-lambda-powertools, and thought that the metric decorator of aws-lambda-powertools is more elegant, because if I delete a lambda stack the associated manual log filter needs re-configuring to point to the correct log again. |
EMF does not support 1 second time resolution. What is the reported sample count for the metric in question? Also, this appears to be a one off issue that you haven't been able to reproduce, is that correct?
A couple follow up questions on this:
|
I'm trying to figure out whether that is still the case or if it was a transient issue, since I too can't reproduce it on my end. If it is consistent, could you shoot me an email with the following info? I'll create a ticket with CloudWatch directly as an exception here, as this seems like a transient issue on CloudWatch side (EMF blob is being generated correctly by the library here)
Thanks a lot |
It is indeed consistent; I will send an email, thank you. |
Thanks a lot for sending that info over - we now know this is happening because of that Metric Dimension @pcolazurdo and I looked over all metric data points and we can clearly see it creating two metrics. Here's what you can do to fix now and what we'll do to make it less error prone: Before METRICS = Metrics()
METRICS.add_dimension(name="environment", value=os.environ["NODE_ENV"]) # Don't unless it's for cold start only
@METRICS.log_metrics(capture_cold_start_metric=
os.environ.get("POWERTOOLS_METRICS_CAPTURE_COLD_START",
"false").lower() == "true")
@TRACER.capture_lambda_handler(capture_error=True)
def main(event, context):
... After METRICS = Metrics()
@METRICS.log_metrics(capture_cold_start_metric=
os.environ.get("POWERTOOLS_METRICS_CAPTURE_COLD_START",
"false").lower() == "true")
@TRACER.capture_lambda_handler(capture_error=True)
def main(event, context):
# This will always add this metric dimension across all metrics you add
METRICS.add_dimension(name="environment", value=os.environ["NODE_ENV"]) What happens is that CloudWatch understands a metric to be unique as a combination of Metric + Dimension(s) name. By looking at your logs you'll see this key here: {
"_aws": {
"Timestamp": 1618965868129,
"CloudWatchMetrics": [
{
"Namespace": "ThriEntranceCounterLambdaGenesis",
"Dimensions": [
[
"service"
]
], This would be different during cold start as you'd have, hence why a single data point and why it's consistent and also why we couldn't reproduce in our accounts: {
"_aws": {
"Timestamp": 1618965868129,
"CloudWatchMetrics": [
{
"Namespace": "ThriEntranceCounterLambdaGenesis",
"Dimensions": [
[
"service",
"environment"
]
], Moving forward@pcolazurdo and I talked about making it easier to define dimensions that should always be available for all metrics, plus some documentation updates. This is what we're going to do for next week's release: **Add a new method Update docs to clarify examples with dimensions. We will update the docs docs to have all Please let us know how that goes in the meantime we make these fixes, and our sincere apologies for the inconvenience caused. |
Oh wow, thanks for the clear explanation and for helping me debug this one. It is quite surprising behaviour to me; as in the non-cold start runs; I would have thought this data would persist between invocations in the same way that the Metrics() instance itself persists between cold starts.
I presume this should work too? |
Yes @bml1g12 it works too. Anything inside the What happens here is that Lambda web service 1/ That is why your I'm finishing the Logger refactor PR and will add a new method that you can safely add anywhere you want to prevent this from happening: |
What happens from a data persisting point of view is that when using
If we don't do step 2, you'll have metrics and dimensions that you might not want. This could lead to many side effects too. In any case, moving to the handler works, and we will address these points as I mentioned earlier so they won't confuse anyone anymore. |
I've added the new I'll add one extra banner to make even more explicit, but would be grateful if you believe this resolves this confusion further. PR: #410 Thank you @bml1g12 |
Yup LGTM, I'm happy to close this topic, but haven't had a chance to test the fix yet |
Great. Let's leave it open until we hear from you. No rush
…On Fri, 23 Apr 2021 at 01:27, Benjamin Lowe ***@***.***> wrote:
Yup LGTM, I'm happy to close this topic, but haven't had a chance to test
the fix yet
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#406 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZPQBEBY5LB5DLWMMC3FGDTKCWGXANCNFSM43JLO3ZQ>
.
|
Moving to inside the handler resolved this issue; thanks for the support! |
Awesome to hear! This is the new docs with the new feature to make this process less error prone, all examples were updated to be inside the handler to reflect this discussion, and I added a warning banner just above this section as promised |
I am publishing metrics like this
Where my lambda handler top level function has been decorated like this:
I expect to see cloudwatch metrics here being logged every few minutes; but instead, I see them every 40 minutes or so; or at least at erratic times with a much larger interval than they are being logged.
e.g.
I have checked my cloudwatch logs and found the following two entries suggesting that on some level things are working as expected:
and
So would expect to see a metric at 1618965743152 and 1618965868129
or Wednesday, 21 April 2021 09:42:23.152 GMT+09:00 and Wednesday, 21 April 2021 09:44:28.129 GMT+09:00 respectively.
But instead, I see the following when aggregating at 1-second sum:
Am I using this functionality wrong? Or is there some sort of built in default aggregation over a large time period somewhere?
Environment
aws-lambda-powertools==1.12.0
A layer created with:
Python 3.7
The text was updated successfully, but these errors were encountered: