Skip to content

Feature request: Calculate Log Sampling Per Request #3278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
jasonwadsworth opened this issue Nov 1, 2024 · 25 comments · Fixed by #3644 or #3672
Closed
2 tasks done

Feature request: Calculate Log Sampling Per Request #3278

jasonwadsworth opened this issue Nov 1, 2024 · 25 comments · Fixed by #3644 or #3672
Assignees
Labels
completed This item is complete and has been merged/shipped feature-request This item refers to a feature request for an existing or new utility logger This item relates to the Logger Utility

Comments

@jasonwadsworth
Copy link

Use case

The log sampling concept is great but not very useful in practice if you have provisioned concurrency turned on due to the decision happening at initialization.

Solution/User Experience

Per request sampling would be ideal, as opposed to per log sampling. That would give the user the ability to see the entirety of a request that was sampled rather than a mix of logs for different requests.

Alternative solutions

No response

Acknowledgment

Future readers

Please react with 👍 and your use case to help us understand customer demand.

@jasonwadsworth jasonwadsworth added feature-request This item refers to a feature request for an existing or new utility triage This item has not been triaged by a maintainer, please wait labels Nov 1, 2024
@dreamorosi
Copy link
Contributor

dreamorosi commented Nov 4, 2024

Hi @jasonwadsworth, thanks for opening this issue.

This is an intended limitation that is documented in the docs, and we designed this feature this way to avoid adding overhead for those customers who don't use the feature.

Up until now we have never gotten any demand of changing this behavior, but it's something we could consider provided that there's enough demand from multiple customers.

In the meantime, you can call the logger.refreshSampleRateCalculation() method manually at the beginning or end of your function handler to recalculate the sampling.

@dreamorosi dreamorosi added logger This item relates to the Logger Utility need-customer-feedback Requires more customers feedback before making or revisiting a decision revisit-in-3-months Blocked issues/PRs that need to be revisited and removed triage This item has not been triaged by a maintainer, please wait labels Nov 4, 2024
@dreamorosi
Copy link
Contributor

Hi @jasonwadsworth, are you by chance using the Middy.js middleware or class method decorator like:

import { Logger } from '@aws-lambda-powertools/logger';
import { injectLambdaContext } from '@aws-lambda-powertools/logger/middleware';
import middy from '@middy/core';

export const handler = middy(async () => {
  // your code
}).use(injectLambdaContext(logger);

or

import { Logger } from '@aws-lambda-powertools/logger';

class Lambda {
  @logger.injectLambdaContext()
  async handler (event, context) {
    // your code
  }
}

const lambda = new Lambda();
export const handler = lambda.handler.bind(lambda);

If yes, we could add an option to both of them to allow you to configure this behavior and opt-in into recalculating the rate at every request, for example:

import { Logger } from '@aws-lambda-powertools/logger';
import { injectLambdaContext } from '@aws-lambda-powertools/logger/middleware';
import middy from '@middy/core';

export const handler = middy(async () => {
  // your code
}).use(injectLambdaContext(logger, { refreshSampleRateCalculation: true }); // same for the decorator

This would be an unobtrusive change that we could roll out easily at any time.l

@jasonwadsworth
Copy link
Author

I'm not, but I was going to try your first suggestion for now.

@dreamorosi
Copy link
Contributor

Thank you for clarifying.

As of now, unless you're using one of these two methods then the Logger has no way of knowing when a request starts/ends, so it'd be tricky to implement.

@dreamorosi dreamorosi moved this from Ideas to Backlog in Powertools for AWS Lambda (TypeScript) Feb 14, 2025
@dreamorosi dreamorosi added confirmed The scope is clear, ready for implementation and removed need-customer-feedback Requires more customers feedback before making or revisiting a decision revisit-in-3-months Blocked issues/PRs that need to be revisited labels Feb 14, 2025
@dreamorosi dreamorosi moved this from Backlog to Ideas in Powertools for AWS Lambda (TypeScript) Feb 14, 2025
@dreamorosi dreamorosi added discussing The issue needs to be discussed, elaborated, or refined and removed confirmed The scope is clear, ready for implementation labels Feb 14, 2025
@dreamorosi
Copy link
Contributor

Had a discussion about this with @am29d yesterday and I thought it'd be useful to bring here the outcomes of the discussion, primarily for future implementation & consideration.

The log sampling feature changes the log level of a Logger to debug for a percentage of requests. Customers can set a rate in the Logger constructor and this rate is used to determine whether or not to change the log level to debug. This is useful for customers who want to keep a less verbose log level most of the time, but have more logs emitted for a percentage of their requests.

As it stands, the feature doesn't exactly behaves as described above. This is because the percentage or ratio is not calculated at the request level, but rather when the Logger class is instantiated, which is usually during the INIT phase of the execution environment, i.e.

import { Logger } from '@aws-lambda-powertools/logger';

const logger = new Logger({
  logSampling: 0.5, // whether or not the log level is switched to `debug` is decided here only
});

export const handler = async () => {
  // ... your logic
}

This means that all the requests served by the same environment will inherit the sampling decision that was made when the environment was initialized, which in turn results in a sampling rate different than the desired one. The degree of this difference will depend on how many environments are spun up and the distribution of requests among them.

To explain what I mean by that, let's consider this example that has 3 environments/sandboxes and a number of requests distributed across them, and - for the sake of simplicity - a log sampling of 0.5 (aka 50%):

Image

Assuming a truly random chance of 50%, one could end up in a situation like the above, which would result in a sample rate of ~85% rather than the expected 50%.

To get around this, and instead get a more closer rate to the desired 50%, customers can use the logger.refreshSampleRateCalculation() method mentioned above at the start/end of each request.

import { Logger } from '@aws-lambda-powertools/logger';

const logger = new Logger({
  logSampling: 0.5,
});

export const handler = async () => {
  // ... your logic

  logger.refreshSampleRateCalculation(); // make a new sampling decision
}

When called, this method essentially flips the coin again and decides whether the log level should be switched to debug or not. Because this is done at the request level, statistically speaking, the ratio of sampled requests should be much closer to the desired one:

Image

With this in mind, we should consider easing this situation for customers by adding an optional flag to our class method decorator and Middy.js middleware so that when this flag is enabled, we'll call the logger.refreshSampleRateCalculation() method for them at the end of each request, as proposed here.

The flag would be false by default to maintain backward compatibility, although in a future major version we could consider making it enabled by default since this would be a much accurate behavior than the current one.

Obviously, as mentioned above, this would work only if we're able to wrap the handler, so customers who are not using either of the two mechanisms just mentioned would have to continue calling the logger.refreshSampleRateCalculation() manually.

@leandrodamascena
Copy link
Contributor

Thanks for diving deep into the investigation @dreamorosi! I created a small PoC in Python with our Logger and we have much more accurate debug lines than the current implementation. To be honest, I think our current implementation is useless in most cases and don't add anything to customers, especially since it happens at the Logger constructor level and it can sample 100% of the requests or 0% of the requests.

from aws_lambda_powertools import Logger

sampling_rate = 0.5
total_runs = 1000
sampled_count = 0
actual_rate = 0

logger = Logger(sampling_rate=sampling_rate)

for i in range(0,total_runs):
    # logger.debug("sample", tentative=i) # just omitting to not print too much in the screen
   logger.reset_sampling()

sampled_count = logger.total_sampled # internal variable just to get the number of times the decision was True

actual_rate = sampled_count / total_runs

print(f"Sampling rate: {sampling_rate}")
print(f"Total runs: {total_runs}")
print(f"Times sampled: {sampled_count}")
print(f"Actual sampling rate: {actual_rate:.2f}")

output

(venv) ➜  leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 508
Actual sampling rate: 0.51
(venv) ➜  leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 504
Actual sampling rate: 0.50
(venv) ➜  leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 470
Actual sampling rate: 0.47
(venv) ➜  leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 491
Actual sampling rate: 0.49
(venv) ➜  leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 503
Actual sampling rate: 0.50
(venv) ➜  leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 493
Actual sampling rate: 0.49

With this in mind, we should consider easing this situation for customers by adding an optional flag to our class method decorator and Middy.js middleware so that when this flag is enabled, we'll call the logger.refreshSampleRateCalculation() method for them at the end of each request, as proposed #3278 (comment).

I'm not sure I agree with this! Our current implementation is basically saying: try your luck! I think it makes sense to make this the default for anyone using the decorator (in Python too) and to change our documentation to say: this is the expected behavior, we fixed it for those that are using decorator. If not using decorator, use this function.

I'm more than happy to use "it's better to ask for forgiveness than permission" here. But of course I agree with any decision.

@am29d
Copy link
Contributor

am29d commented Feb 20, 2025

After checking with Java and .NET runtimes with Henrique and Philipp, we determined that other runtimes don't have this problem, only Python and NodeJS.

For TypeScript and Python my proposal is

  1. Add a new option to the logger decorator .injectLambdaContext( { samplingRate: 0.5 })
  2. Mark the samplingRate property in constructor as deprecated
  3. Set the default behaviour of the decorator to call refrehsSampleRateCalculation() if sampleRate is set either in constructor or in the decorator options
  4. Set sampleRate option in the decorator to the highest priority

At this point I'd be happy to explore an option to add a new decorator, because injectLambdaContext has become a container for injecting dynamic behaviour that does far more than injecting context, i.e. logEvent, resetKeys

@dreamorosi
Copy link
Contributor

So basically we're saying that you can't apply log sampling if you're not using a decorator or middleware?

@am29d
Copy link
Contributor

am29d commented Feb 20, 2025

So basically we're saying that you can't apply log sampling if you're not using a decorator or middleware?

Hmm yeah just noticed as well that we would rule out this option. I am on the fence.

I want to avoid too many options. Setting sampling in the constructor forces you to use additional method to make it work properly. Alternatively, we can keep the sampleRate configuration, and call the the refresh for decorator and middy by default, and customer who don't use them, need to refresh manually.

But then again, we would need to specifically say that if you use sampleRate you need a decorator. I'd be ideal if you set the option in one place and don't have to think about any additional configuration, thus moving it to decorator and middy.

@dreamorosi
Copy link
Contributor

I want to avoid too many options. Setting sampling in the constructor forces you to use additional method to make it work properly. Alternatively, we can keep the sampleRate configuration, and call the refresh for decorator and middy by default, and customer who don't use them, need to refresh manually.

I'd be inclined to do this - so the feature is additive

But I agree with Leandro's proposal above that we should improve the documentation and basically say that if you're not using decorator/middleware you must call refreshSampleRateCalculation() or it's not a true percentage of requests.

With that said, I don't have strong feelings. With the addition of log buffering I'd prefer to use that rather than having log sampling of debug logs anyway.

@dreamorosi
Copy link
Contributor

Can you also elaborate on how is that Java / .NET don't have this issue?

@am29d
Copy link
Contributor

am29d commented Feb 20, 2025

Ok, let's keep all the options. I agree that it does not make sense to not support a feature only because customers don't use decorator or middy.

I will run some tests on Java and .NET to confirm.

@hjgraca
Copy link
Contributor

hjgraca commented Feb 20, 2025

@am29d let me know if you need help testing .NET

@dreamorosi
Copy link
Contributor

ok then - let's move onto implementation next week (for TS).

Alex, I changed the status of the issue - hope everything is clear on your side.

Copy link
Contributor

⚠️ COMMENT VISIBILITY WARNING ⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

@github-actions github-actions bot added pending-release This item has been merged and will be released soon and removed confirmed The scope is clear, ready for implementation labels Feb 24, 2025
@dreamorosi dreamorosi reopened this Feb 24, 2025
@github-project-automation github-project-automation bot moved this from Coming soon to Pending review in Powertools for AWS Lambda (TypeScript) Feb 24, 2025
@dreamorosi dreamorosi added discussing The issue needs to be discussed, elaborated, or refined and removed pending-release This item has been merged and will be released soon labels Feb 24, 2025
@dreamorosi dreamorosi moved this from Pending review to Working on it in Powertools for AWS Lambda (TypeScript) Feb 24, 2025
@dreamorosi
Copy link
Contributor

dreamorosi commented Feb 24, 2025

I am reverting the PR we merged because there's an edge case we should discuss.

In the current implementation (the broken one), we emit a log like this when the sampling decision is successful:

{"level":"DEBUG","message":"Setting log level to DEBUG due to sampling rate","timestamp":"2025-02-24T17:24:10.141Z","service":"service_undefined","sampling_rate":0.5}

This log was emitted outside of the handler during the INIT phase because the sampling decision happened when initializing the Logger class.

With the new implementation, depending on when we are refreshing the sampling rate decision a couple of things can happen.

For the sake of simplicity, let's assume a sample rate of 0.5 (aka 50%).

Case 1

If we refresh the sample rate before calling the handler when decorating it, when there's a cold start, the sampling decision will be made twice:

  • when instantiating the Logger
  • before calling the handler

With this in mind, you could get in a situation where the first decision yields a sampled and the second a NOT sampled.

As an operator, you see the logs and see the Setting log level to DEBUG due to sampling rate log, but in practice the invocation was NOT sampled, and your debug logs are not there.

i.e.

See function code
import { Logger } from "@aws-lambda-powertools/logger";
import type { Context } from "aws-lambda";

const logger = new Logger({
	sampleRateValue: 0.5,
});

class Lambda {
	@logger.injectLambdaContext()
	async handler(_event: unknown, _context: Context) {
		logger.debug("test debug");
		logger.info("test info");
	}
}

const lambda = new Lambda();
export const handler = lambda.handler.bind(lambda);
{"level":"DEBUG","message":"Setting log level to DEBUG due to sampling rate","timestamp":"2025-02-24T17:24:10.141Z","service":"service_undefined","sampling_rate":0.5}
{"level":"INFO","message":"test info","timestamp":"2025-02-24T17:24:10.606Z","service":"service_undefined","cold_start":true,"function_arn":"arn:aws:lambda:eu-west-1:123456789012:function:testing-aamorosi-metricsFunction-rkvztfzb","function_name":"testing-aamorosi-metricsFunction-rkvztfzb","function_request_id":"bf5780a1-2bc1-44e2-b953-6948bbc19a56","sampling_rate":0.5}

Case 2

If we refresh the sample rate after calling the handler when decorating it, the problem above doesn't exist, but now we have a different problem at potentially every other request.

Since we are calling the refreshSampleRateCalculation() method at the end of each request, the Setting log level to DEBUG due to sampling rate debug log will appear in the wrong request.

This is because refreshing the calculation at the end of a request means essentially deciding whether the next request will be sampled.

As an operator, I'd both see the log in a request that was potentially not sampled, and also not see it in a request that was sampled.


With this in mind, I think my proposal is to remove completely the sampling decision from the Logger class initialization and refresh the sample rate calculation before every request.

This assures that the Setting log level to DEBUG due to sampling rate debug log appears in the correct request log, but also means it's now very important for customers who are doing this manually (aka not using decorator) to call the method in the right place.

The problem with this is that customers who were using the feature and not calling refreshSampleRateCalculation() at all, now will never have sampling, which is a breaking change. Any ideas?

@leandrodamascena
Copy link
Contributor

Thank you for bringing these edge cases to light @dreamorosi.

Case 1

If we refresh the sample rate before calling the handler when decorating it, when there's a cold start, the sampling decision will be made twice:

  • when instantiating the Logger
  • before calling the handler

With this in mind, you could get in a situation where the first decision yields a sampled and the second a NOT sampled.

Idk in TypeScript, but in Python I can control to call the sampling decision at once when it's cold start, so, it will never be called twice for the first invocation. Can you do the same?

As an operator, you see the logs and see the Setting log level to DEBUG due to sampling rate log, but in practice the invocation was NOT sampled, and your debug logs are not there.

In Python, I can refresh the sampling decision before each invocation, so the message "Setting log level to DEBUG due to sampling rate" will appear on the invocation that was sampled.

from collections import namedtuple
import uuid
from aws_lambda_powertools import Logger

logger = Logger(sampling_rate=0.5, level="INFO")

@logger.inject_lambda_context
def lambda_handler(event, context):
    logger.debug("my debug")
    logger.debug("my debug2")
    logger.debug("my debug3")

def lambda_context():
    lambda_context = {
        "function_name": "test",
        "memory_limit_in_mb": 128,
        "invoked_function_arn": "arn:aws:lambda:eu-west-1:809313241:function:test",
        "aws_request_id": str(uuid.uuid4()),
    }

    return namedtuple("LambdaContext", lambda_context.keys())(*lambda_context.values())


for i in range(0,4):
    lambda_handler(None, lambda_context())

Look at function_request_id:

{"level":"DEBUG","location":"_configure_sampling:373","message":"Setting log level to DEBUG due to sampling rate","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"lambda_handler:10","message":"my debug","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"lambda_handler:12","message":"my debug2","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"lambda_handler:14","message":"my debug3","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"_configure_sampling:373","message":"Setting log level to DEBUG due to sampling rate","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}
{"level":"DEBUG","location":"lambda_handler:10","message":"my debug","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}
{"level":"DEBUG","location":"lambda_handler:12","message":"my debug2","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}
{"level":"DEBUG","location":"lambda_handler:14","message":"my debug3","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}

Case 2

If we refresh the sample rate after calling the handler when decorating it, the problem above doesn't exist, but now we have a different problem at potentially every other request.

Since we are calling the refreshSampleRateCalculation() method at the end of each request, the Setting log level to DEBUG due to sampling rate debug log will appear in the wrong request.

This is because refreshing the calculation at the end of a request means essentially deciding whether the next request will be sampled.

It's the same as I explained before, so I'll skip it.

As an operator, I'd both see the log in a request that was potentially not sampled, and also not see it in a request that was sampled.

With this in mind, I think my proposal is to remove completely the sampling decision from the Logger class initialization and refresh the sample rate calculation before every request.

The problem with this is that customers who were using the feature and not calling refreshSampleRateCalculation() at all, now will never have sampling, which is a breaking change. Any ideas?

Can you check if this is a cold start and not a refresh sampling decision twice via decorator? In this case we do not change this behavior of performing sampling at constructor level.

In Python it is something like this:

Image

Please let me know if you need any additional clarification from my side.

@dreamorosi
Copy link
Contributor

Hey, yes, we can do that, we have a method as well.

I didn't think of that - let's do it.

@am29d - whenever you're ready maybe we can reopen the PR with this change.

@am29d
Copy link
Contributor

am29d commented Feb 26, 2025

I was looking into our implementation and there is a blocker atm that we need to resolve first.

In our current implementation we have two methods, getColdStart() and isColdStart(), with the latter as an alias for the former. The way getColdStart is implemented is that it changes the flag of the coldStart private field:

  public getColdStart(): boolean {
    if (this.coldStart) {
      this.coldStart = false;

      return true;
    }

    return false;
  }

Besides for the super weird design to add a side-effect to a getSomething function, it worked well for our use case, so far. Until... you only want to read the coldStart field and not flip the flag again. Thus, we need to make changes to the Utility class in a separate PR before we can proceed.

We can either change coldStart field to protected, or add a new function to read the field (and not change it), or change the isColdStart to return what it means. Let's move the discussion to separate issue.

Copy link
Contributor

⚠️ COMMENT VISIBILITY WARNING ⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

@github-actions github-actions bot added pending-release This item has been merged and will be released soon and removed discussing The issue needs to be discussed, elaborated, or refined labels Feb 27, 2025
Copy link
Contributor

github-actions bot commented Mar 7, 2025

This is now released under v2.16.0 version!

@github-actions github-actions bot added completed This item is complete and has been merged/shipped and removed pending-release This item has been merged and will be released soon labels Mar 7, 2025
@dreamorosi dreamorosi moved this from Coming soon to Shipped in Powertools for AWS Lambda (TypeScript) Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
completed This item is complete and has been merged/shipped feature-request This item refers to a feature request for an existing or new utility logger This item relates to the Logger Utility
Projects
5 participants