Skip to content

No Issue #4338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
iamjosephmj opened this issue Jan 23, 2025 · 2 comments
Closed

No Issue #4338

iamjosephmj opened this issue Jan 23, 2025 · 2 comments

Comments

@iamjosephmj
Copy link

iamjosephmj commented Jan 23, 2025

-Issue Deleted-

@dkhalanskyjb
Copy link
Collaborator

Thanks for the suggestion! I don't fully understand the use cases this is supposed to help with. Could you clarify them?

Track when coroutines start, suspend, resume, and complete.

Why is this useful?

Log or capture metrics without manually instrumenting each coroutine builder.

Which specific metrics do you want to capture and why?

Diagnose concurrency or performance issues in environments where taking snapshots (e.g., DebugProbes.dumpCoroutines()) is too heavy.

In which cases is making a snapshot is too heavy, but adding extra code to each coroutine resumption isn't? With snapshots, you are only paying for coroutines that exist at the moment the snapshot is made, but with the instrumentation you propose, every single coroutine will have to pay the price all the time. With DebugProbes, there is also some extra overhead for each suspension/resumption, but it's nothing when compared to println (which you provided as an example).

@dkhalanskyjb
Copy link
Collaborator

I'm very sorry if I'm wrong, but I get the impression your text is written by AI. If so, I must ask you to either stop using AI or close the issue, as discussing this problem with ChatGPT has no chance of leading to any useful insights.

Let's assume that you didn't use AI, have an actual business need, and can explain it.

The examples of metrics that you propose are still unclear to me. Why should the average suspension be an indicator of some issue? If you make many network requests, for example, the average suspension will be long. If you replace an inefficient spinlock on data with a suspension, the average suspension in your program will be longer, but the resource utilization will be improved. Likewise, average completion times will mostly depend on what kind of work coroutines do, not on how efficiently they do it. Cancellations are also not a sign of anything going wrong.

Please provide a concrete example of which insights about your program you're hoping to gain from coroutine metrics.

An ongoing record of coroutine transitions lets us reconstruct what changed if incidents occur

This idea is especially suspicious. If your records are so robust that you can retrace the execution using them, it means that this logging is not fire-and-forget but something that must ensure the log entry actually gets stored, but this means the coroutine will have to stop all useful work to ensure robust logging. This will lead to a tremendous performance degradation.

Additionally, live tracking helps catch anomalies right as they happen

None of the things you list seem like anomalies to me.

sudden spikes in suspension duration may hint at lock contention or I/O slowdowns

I/O slowdowns can be measured more directly by looking at the network/filesystem utilization using the operating system tools.

large-scale timeouts or parent scope cancellations

Ideally, timeouts should propagate to a global exception handler where, yes, it makes sense to write something to the log. It's a long-standing issue that our timeouts don't work like that: #1374. This is unrelated to the current discussion, though.

gradual increases in suspension time after releases

As I mentioned, these can indicate a change in how coroutines are used. End-to-end performance metrics seem like a better fit for this.

insufficient threads for CPU-bound tasks

This will be reflected in end-to-end performance and in the CPU utilization, both of which are metrics that are easier to measure directly.

mismatches between created vs. completed

This makes a lot of sense for threads, but with coroutines, structured concurrency is heavily encouraged, and when used correctly, solves the issue of leaking computations by construction.

@iamjosephmj iamjosephmj changed the title Lightweight Tracing / Debugging Hooks for Coroutines No Issue Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants