Speed up OTLP proto gRPC exporter tests #4014

ocelotl · 2024-07-01T23:24:11Z

These tests had unnecessarily large delays, this PR reduces them.

For some unscientific proof this helps, I ran these tests in my laptop, with main it took it ~34s, with these enhancements, it took ~12s to run them.

Fixes open-telemetry#4013

LarsMichelsen · 2024-07-07T19:05:57Z

exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_metrics_exporter.py

            self.exporter.shutdown()
-            now = time.time()
+            now = time_ns()
            self.assertGreaterEqual(now, (start_time + 30 / 1000))


Either I don't understand the change or there is a new bug here. How can it be correct to change start_time and now from seconds to nanoseconds without changing the 30 / 1000?

Secondly, does anyone understand what this assert is supposed to do? I understand it somehow like: "Ensure that the shutdown took at least 30 seconds.". If this is the case, shouldn't it be 30 * 1000 then originally and now `30 * 1e9?

I think you're right. Would you be interested in submitting a PR with a fix?

I'd rather ping @ocelotl and ask to clean up after this recent change? True, the test seemed to be broken before, but the PR made it even worse.

Maybe I'd still do it, not sure yet. I actually came across this by trying to wrap my head around hanging shutdown issues as e.g. discussed in #3309, #3387 and #3398. If I find some time in the next days I might actually dig deeper into that topic and try to come up with an acceptable solution to that. In that process this test would also have to be touched anyways, I guess.

Thanks for reporting, @LarsMichelsen let me see what can I do here... ✌️

TL;DR The time assert in the test was wrong. Looking at the original commit (c84ba9) of this test, the delay of the mocked backoff was 4 seconds fixed with 6 iterations. So the time until shutdown returned was around 24 seconds. The default timeout of the lock wait in shutdown did not hit before the retry of the exporter was over. The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed to be wrong from the beginning. It just ensured that the shutdown call took at least 30ms. I assume it was supposed to do something else. But to me it's not clear what exactly it should do. One explantion would be that it was intended to do `self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But that would mean: Shutdown should need at least 30 seconds to pass, which does not make sense imho. Another option would be that it was intended to do `self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That would mean: Shutdown should need at maximum 30 seconds to pass, which actually could make more sense. It would ensure that the call of `self.exporter.shutdown` returns after 30 seconds. My biggest issue here is, that it tests the happy path where the retry is over before the timeout while there is no mechanism to ensure that the exporter thread is stopped in a sensible time in case it's current exponential backoff is in a longer sleep already while the shutdown flag is set. If I understand it correctly, it can hit the beginning of a 32 second sleep, which could mean that a process needs a) 30 seconds to signal the shutdown and additional 32 seconds to finish the exporter thread. But that's something for another change. After all these words, looking at the current HEAD, the situation of the test case is basically the same as originally. The retry interval has been changed recently to be way shorter. The default lock timeout stayed the same at 30 seconds, but that does not matter too much since the retry mechanism is finished long before that. And the aforementioned assert is still wrong, with some additional problem from open-telemetry#4014.

ocelotl added tests Skip Changelog PRs that do not require a CHANGELOG.md entry labels Jul 1, 2024

ocelotl self-assigned this Jul 1, 2024

ocelotl requested a review from a team July 1, 2024 23:24

ocelotl force-pushed the issue_4013 branch 2 times, most recently from e68908a to ff7617e Compare July 1, 2024 23:32

Speed up OTLP proto gRPC exporter tests

4a58fab

Fixes open-telemetry#4013

ocelotl force-pushed the issue_4013 branch from ff7617e to 4a58fab Compare July 1, 2024 23:40

xrmx approved these changes Jul 2, 2024

View reviewed changes

lzchen merged commit 24095f9 into open-telemetry:main Jul 2, 2024
269 checks passed

LarsMichelsen reviewed Jul 7, 2024

View reviewed changes

LarsMichelsen mentioned this pull request Jul 10, 2024

Try to understand and fix test_shutdown_wait_last_export #4042

Open

11 tasks

ocelotl mentioned this pull request Jul 15, 2024

Wrong test value for default timeout in OTLP Exporter Mixin #4057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up OTLP proto gRPC exporter tests #4014

Speed up OTLP proto gRPC exporter tests #4014

Uh oh!

ocelotl commented Jul 1, 2024 •

edited

Loading

Uh oh!

Uh oh!

LarsMichelsen Jul 7, 2024 •

edited

Loading

Uh oh!

pmcollins Jul 8, 2024

Uh oh!

LarsMichelsen Jul 8, 2024

Uh oh!

ocelotl Jul 8, 2024

Uh oh!

Uh oh!

Speed up OTLP proto gRPC exporter tests #4014

Speed up OTLP proto gRPC exporter tests #4014

Uh oh!

Conversation

ocelotl commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LarsMichelsen Jul 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmcollins Jul 8, 2024

Choose a reason for hiding this comment

Uh oh!

LarsMichelsen Jul 8, 2024

Choose a reason for hiding this comment

Uh oh!

ocelotl Jul 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ocelotl commented Jul 1, 2024 •

edited

Loading

LarsMichelsen Jul 7, 2024 •

edited

Loading