-
Notifications
You must be signed in to change notification settings - Fork 1.2k
upserting a pipeline kills caching #2736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a result of Processors in the SDK appending timestamps to their output paths. This changes the input arguments to the ProcessingStep and results in a cache miss when running the pipeline. You can override the automatically generated output path by setting the
Note that this will fix your caching issue, but a new problem is introduced: All executions of this pipeline will write to the same S3 path. A new execution will overwrite the output of an old execution (generally OK), but concurrent pipeline executions will both be writing to the same S3 path at the same time (can break things if your processing jobs are outputting statically named paths and files). To address that, you could parameterize the S3 path like this:
At least this way you have some control over the behavior when starting an execution. If you want caching to work then you can call |
This still results in a cache miss. I believe the reason is that even the code script ("abalone/preprocessing.py") will be a |
You're right, it turns out we are not passing the This should be a simple fix which we'll prioritize. The only workaround I can offer now is providing an S3 URI for your |
@staubhp what about for the training step? I am trying the Abalon example and still getting cache misses. |
@giovannirescia-upwork Training has a different problem; it is appending a timestamp to the default profiler rule:
If you're not using profiling, you can disable this in your estimator to get rid of that dynamic element:
If you need profiling, you'll need to set the |
@neilmcguigan Looks like the fix was merged, closing now, feel free to re-open if you have any other concern. |
I'm wondering if there's going to be any better solution than this. Sometimes we don't know whether an execution will need to execute concurrently with another execution or not. My preferred behavior, which would solve this problem, would be something like this: the output paths of a step shouldn't be part of the cache key; rather, they should be copied from the cache hit when applicable. So, if the step is a cache miss and it executes, its output path will be what was assigned, for example a path with a timestamp or unique identifier, so we're not conflicting with any other path. If, however, the step is a cache hit, it should inherit the output paths from the step that was cached - after all, its outputs are obviously the same since the step isn't being executed. This seems more in line with how a cache should work, a sort of key-value store. In the current implementation, the cache isn't used to retrieve any outputs, which I'd argue is counter-intuitive. Consider now how this would handle 2 concurrent executions of a step. If both executions are a cache miss, by this logic, both executions complete and both write to their own unique location. Both are then valid for any downstream pipeline steps. Future executions of the step will cache hit and retrieve one of these executions, the one that finished last. That execution's outputs will now be the (skipped due to caching) step's outputs. For example, if we need this step's output as the input to another step, the information will be propagated. Right now, this kind of logic doesn't work because setting a unique output location (or omitting the explicit output location) for a step causes it to always cache miss. Is there any way to exclude the outputs from the cache key so that caching can work like this? |
@staubhp For training step, the suggested fix does not work and the step is not cached when the pipeline is upserted. |
Describe the bug
Upserting a pipeline always causes cache misses
To reproduce
Create a pipeline with caching enabled. Run the pipeline. Upsert the pipeline. Run the pipeline again. You'll get a cache miss every time
Expected behavior
I expect a cache hit, even if I upsert a pipeline
System information
A description of your system. Please provide:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: