Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Propagate Step Function Trace Context through Managed Services #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate Step Function Trace Context through Managed Services #573
Changes from 6 commits
a38c77b
0a51475
5700d22
f524509
55ff075
d7daf0d
e490610
e31c8f8
dc4a283
f59de19
6222016
45ba945
dc43c9e
eec4743
1b350b9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about how the concatenation of two queues (e.g., SFN → EventBridge → SQS → Lambda) is handled. Is it achieved by extracting two different contexts in the Python tracer? Does this mean that it also supports SFN → EventBridge → SQS → SNS → Lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SFN → EventBridge → SQS → Lambda is handled the following way
SFN → SNS → SQS → Lambda is handled very similarly with another explicitly check in the SQS extractor looking for SNS events nested
We don't handle SFN → SQS → SNS → Lambda AFAIK but we wouldn't be able to handle SFN → EventBridge → SQS → SNS → Lambda out of the box either
But this is only because it's not explicitly handled. The current python layer implementation is messy because it relies on explicit handling. I think a perfect solution would be one where it's all handled recursively and customers can nest an arbitrary number of supported services without explicit handling
I think AWS team would like to do something like this in bottlecap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avedmala Thanks for the explanation. Very informative. I am guessing that a recursive solution should not be that complicated? @purple4reina @joeyzhao2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the "recursive solution", is it written down in any RFC? it sounds interesting and might be able to solve some other problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without (1) and with (2) this logic comparison
By default, the timestamp provided by eventbridge is only down to the second and this can cause a case where the span starts before the step function (1). If we use the state entered time, this can look less confusing to customers (2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
ps. I was confused about what
(1)
and(2)
mean when I first glanced the PR. Originally, I thought that's some logic before this section of code that got hidden on the github.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added numbers next to the pictures to make it clearer for other reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important because we have the following code in
tracing.create_function_execution_span()
where
parent_span
is the generated inferred span so the Lambda's root span'sparent_id
will be set to the inferred span'sspan_id
If there is an upstream Step Function and we saved its trace context in
dd_trace_context
, we want to preserve the parenting relationship and not let the inferred span completely erase itThis line solves the issue by making the inferred span be a child of the upstream service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can memoize this function? It looks like it can potentially be called several times in the course of a single invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, or it looks like the function can be called multiple times per invocation, but with different "events" each time? If that's true, then we can probably leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea!
Correct me if I'm wrong but does the layer only handle one
event
per invocation? Or if it's a busy Lambda does it stay alive and potentially handle hundreds of events?Just wondering to get an idea of how large to make the cache. I guess it can be pretty small anyway since each event is new and we don't repeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each runtime instance will only ever handle one event at a time. It never handles two events concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah just realized we can't memoize it because
event
is adict
and mutable types are unhashableWe could serialize the dict and use that but I'm thinking that'd be much slower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like these comments. For someone who hasn't work on step functions for a while, these comments help me recollect these historical context. It'll help future maintenance of the code as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this unwrapping to happen inside of
tracing.extract_context_from_step_functions()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for putting the tests here which also make the code easier to understand for the future.