Skip to content

Speed up OTLP proto gRPC exporter tests #4014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2024

Conversation

ocelotl
Copy link
Contributor

@ocelotl ocelotl commented Jul 1, 2024

Fixes #4013

These tests had unnecessarily large delays, this PR reduces them.

For some unscientific proof this helps, I ran these tests in my laptop, with main it took it ~34s, with these enhancements, it took ~12s to run them.

@ocelotl ocelotl added tests Skip Changelog PRs that do not require a CHANGELOG.md entry labels Jul 1, 2024
@ocelotl ocelotl self-assigned this Jul 1, 2024
@ocelotl ocelotl requested a review from a team July 1, 2024 23:24
@ocelotl ocelotl force-pushed the issue_4013 branch 2 times, most recently from e68908a to ff7617e Compare July 1, 2024 23:32
@lzchen lzchen merged commit 24095f9 into open-telemetry:main Jul 2, 2024
269 checks passed
self.exporter.shutdown()
now = time.time()
now = time_ns()
self.assertGreaterEqual(now, (start_time + 30 / 1000))
Copy link

@LarsMichelsen LarsMichelsen Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either I don't understand the change or there is a new bug here. How can it be correct to change start_time and now from seconds to nanoseconds without changing the 30 / 1000?

Secondly, does anyone understand what this assert is supposed to do? I understand it somehow like: "Ensure that the shutdown took at least 30 seconds.". If this is the case, shouldn't it be 30 * 1000 then originally and now `30 * 1e9?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. Would you be interested in submitting a PR with a fix?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather ping @ocelotl and ask to clean up after this recent change? True, the test seemed to be broken before, but the PR made it even worse.

Maybe I'd still do it, not sure yet. I actually came across this by trying to wrap my head around hanging shutdown issues as e.g. discussed in #3309, #3387 and #3398. If I find some time in the next days I might actually dig deeper into that topic and try to come up with an acceptable solution to that. In that process this test would also have to be touched anyways, I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting, @LarsMichelsen let me see what can I do here... ✌️

LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this pull request Jul 10, 2024
TL;DR The time assert in the test was wrong.

Looking at the original commit (c84ba9) of this test, the delay of the
mocked backoff was 4 seconds fixed with 6 iterations. So the time until
shutdown returned was around 24 seconds. The default timeout of the lock
wait in shutdown did not hit before the retry of the exporter was over.

The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed
to be wrong from the beginning. It just ensured that the shutdown call
took at least 30ms. I assume it was supposed to do something else. But
to me it's not clear what exactly it should do.

One explantion would be that it was intended to do
`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But
that would mean: Shutdown should need at least 30 seconds to pass, which
does not make sense imho.

Another option would be that it was intended to do
`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That
would mean: Shutdown should need at maximum 30 seconds to pass, which
actually could make more sense. It would ensure that the call of
`self.exporter.shutdown` returns after 30 seconds.

My biggest issue here is, that it tests the happy path where the retry
is over before the timeout while there is no mechanism to ensure that the
exporter thread is stopped in a sensible time in case it's current
exponential backoff is in a longer sleep already while the shutdown flag
is set. If I understand it correctly, it can hit the beginning of a 32
second sleep, which could mean that a process needs a) 30 seconds to
signal the shutdown and additional 32 seconds to finish the exporter
thread. But that's something for another change.

After all these words, looking at the current HEAD, the situation of the
test case is basically the same as originally. The retry interval has
been changed recently to be way shorter. The default lock timeout stayed
the same at 30 seconds, but that does not matter too much since the
retry mechanism is finished long before that. And the aforementioned
assert is still wrong, with some additional problem from open-telemetry#4014.
LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this pull request Aug 19, 2024
TL;DR The time assert in the test was wrong.

Looking at the original commit (c84ba9) of this test, the delay of the
mocked backoff was 4 seconds fixed with 6 iterations. So the time until
shutdown returned was around 24 seconds. The default timeout of the lock
wait in shutdown did not hit before the retry of the exporter was over.

The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed
to be wrong from the beginning. It just ensured that the shutdown call
took at least 30ms. I assume it was supposed to do something else. But
to me it's not clear what exactly it should do.

One explantion would be that it was intended to do
`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But
that would mean: Shutdown should need at least 30 seconds to pass, which
does not make sense imho.

Another option would be that it was intended to do
`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That
would mean: Shutdown should need at maximum 30 seconds to pass, which
actually could make more sense. It would ensure that the call of
`self.exporter.shutdown` returns after 30 seconds.

My biggest issue here is, that it tests the happy path where the retry
is over before the timeout while there is no mechanism to ensure that the
exporter thread is stopped in a sensible time in case it's current
exponential backoff is in a longer sleep already while the shutdown flag
is set. If I understand it correctly, it can hit the beginning of a 32
second sleep, which could mean that a process needs a) 30 seconds to
signal the shutdown and additional 32 seconds to finish the exporter
thread. But that's something for another change.

After all these words, looking at the current HEAD, the situation of the
test case is basically the same as originally. The retry interval has
been changed recently to be way shorter. The default lock timeout stayed
the same at 30 seconds, but that does not matter too much since the
retry mechanism is finished long before that. And the aforementioned
assert is still wrong, with some additional problem from open-telemetry#4014.
LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this pull request Aug 26, 2024
TL;DR The time assert in the test was wrong.

Looking at the original commit (c84ba9) of this test, the delay of the
mocked backoff was 4 seconds fixed with 6 iterations. So the time until
shutdown returned was around 24 seconds. The default timeout of the lock
wait in shutdown did not hit before the retry of the exporter was over.

The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed
to be wrong from the beginning. It just ensured that the shutdown call
took at least 30ms. I assume it was supposed to do something else. But
to me it's not clear what exactly it should do.

One explantion would be that it was intended to do
`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But
that would mean: Shutdown should need at least 30 seconds to pass, which
does not make sense imho.

Another option would be that it was intended to do
`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That
would mean: Shutdown should need at maximum 30 seconds to pass, which
actually could make more sense. It would ensure that the call of
`self.exporter.shutdown` returns after 30 seconds.

My biggest issue here is, that it tests the happy path where the retry
is over before the timeout while there is no mechanism to ensure that the
exporter thread is stopped in a sensible time in case it's current
exponential backoff is in a longer sleep already while the shutdown flag
is set. If I understand it correctly, it can hit the beginning of a 32
second sleep, which could mean that a process needs a) 30 seconds to
signal the shutdown and additional 32 seconds to finish the exporter
thread. But that's something for another change.

After all these words, looking at the current HEAD, the situation of the
test case is basically the same as originally. The retry interval has
been changed recently to be way shorter. The default lock timeout stayed
the same at 30 seconds, but that does not matter too much since the
retry mechanism is finished long before that. And the aforementioned
assert is still wrong, with some additional problem from open-telemetry#4014.
LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this pull request Aug 29, 2024
TL;DR The time assert in the test was wrong.

Looking at the original commit (c84ba9) of this test, the delay of the
mocked backoff was 4 seconds fixed with 6 iterations. So the time until
shutdown returned was around 24 seconds. The default timeout of the lock
wait in shutdown did not hit before the retry of the exporter was over.

The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed
to be wrong from the beginning. It just ensured that the shutdown call
took at least 30ms. I assume it was supposed to do something else. But
to me it's not clear what exactly it should do.

One explantion would be that it was intended to do
`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But
that would mean: Shutdown should need at least 30 seconds to pass, which
does not make sense imho.

Another option would be that it was intended to do
`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That
would mean: Shutdown should need at maximum 30 seconds to pass, which
actually could make more sense. It would ensure that the call of
`self.exporter.shutdown` returns after 30 seconds.

My biggest issue here is, that it tests the happy path where the retry
is over before the timeout while there is no mechanism to ensure that the
exporter thread is stopped in a sensible time in case it's current
exponential backoff is in a longer sleep already while the shutdown flag
is set. If I understand it correctly, it can hit the beginning of a 32
second sleep, which could mean that a process needs a) 30 seconds to
signal the shutdown and additional 32 seconds to finish the exporter
thread. But that's something for another change.

After all these words, looking at the current HEAD, the situation of the
test case is basically the same as originally. The retry interval has
been changed recently to be way shorter. The default lock timeout stayed
the same at 30 seconds, but that does not matter too much since the
retry mechanism is finished long before that. And the aforementioned
assert is still wrong, with some additional problem from open-telemetry#4014.
LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this pull request Sep 7, 2024
TL;DR The time assert in the test was wrong.

Looking at the original commit (c84ba9) of this test, the delay of the
mocked backoff was 4 seconds fixed with 6 iterations. So the time until
shutdown returned was around 24 seconds. The default timeout of the lock
wait in shutdown did not hit before the retry of the exporter was over.

The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed
to be wrong from the beginning. It just ensured that the shutdown call
took at least 30ms. I assume it was supposed to do something else. But
to me it's not clear what exactly it should do.

One explantion would be that it was intended to do
`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But
that would mean: Shutdown should need at least 30 seconds to pass, which
does not make sense imho.

Another option would be that it was intended to do
`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That
would mean: Shutdown should need at maximum 30 seconds to pass, which
actually could make more sense. It would ensure that the call of
`self.exporter.shutdown` returns after 30 seconds.

My biggest issue here is, that it tests the happy path where the retry
is over before the timeout while there is no mechanism to ensure that the
exporter thread is stopped in a sensible time in case it's current
exponential backoff is in a longer sleep already while the shutdown flag
is set. If I understand it correctly, it can hit the beginning of a 32
second sleep, which could mean that a process needs a) 30 seconds to
signal the shutdown and additional 32 seconds to finish the exporter
thread. But that's something for another change.

After all these words, looking at the current HEAD, the situation of the
test case is basically the same as originally. The retry interval has
been changed recently to be way shorter. The default lock timeout stayed
the same at 30 seconds, but that does not matter too much since the
retry mechanism is finished long before that. And the aforementioned
assert is still wrong, with some additional problem from open-telemetry#4014.
LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this pull request Sep 14, 2024
TL;DR The time assert in the test was wrong.

Looking at the original commit (c84ba9) of this test, the delay of the
mocked backoff was 4 seconds fixed with 6 iterations. So the time until
shutdown returned was around 24 seconds. The default timeout of the lock
wait in shutdown did not hit before the retry of the exporter was over.

The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed
to be wrong from the beginning. It just ensured that the shutdown call
took at least 30ms. I assume it was supposed to do something else. But
to me it's not clear what exactly it should do.

One explantion would be that it was intended to do
`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But
that would mean: Shutdown should need at least 30 seconds to pass, which
does not make sense imho.

Another option would be that it was intended to do
`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That
would mean: Shutdown should need at maximum 30 seconds to pass, which
actually could make more sense. It would ensure that the call of
`self.exporter.shutdown` returns after 30 seconds.

My biggest issue here is, that it tests the happy path where the retry
is over before the timeout while there is no mechanism to ensure that the
exporter thread is stopped in a sensible time in case it's current
exponential backoff is in a longer sleep already while the shutdown flag
is set. If I understand it correctly, it can hit the beginning of a 32
second sleep, which could mean that a process needs a) 30 seconds to
signal the shutdown and additional 32 seconds to finish the exporter
thread. But that's something for another change.

After all these words, looking at the current HEAD, the situation of the
test case is basically the same as originally. The retry interval has
been changed recently to be way shorter. The default lock timeout stayed
the same at 30 seconds, but that does not matter too much since the
retry mechanism is finished long before that. And the aforementioned
assert is still wrong, with some additional problem from open-telemetry#4014.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opentelemetry-exporter-otlp-proto-grpc are super slow
5 participants