-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Build: upload artifacts atomically (support instant rollbacks) #9947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Doesn't look like rclone has that "feature", AWS either https://stackoverflow.com/questions/5877190/does-amazon-s3-support-multi-file-atomic-uploads (they do have this per-file). I'd say we should just focus on the retry, rclone has a default of 3 retries https://rclone.org/docs/#retries-int |
Sounds good for a first step to me 👍🏼 . We can also add |
Yea, I think the final version of this would be something like:
Alternatively we could keep the build at Definitely some interesting possible ideas here. |
* Build: rclone retries when uploading artifacts Related #9947 * Update readthedocs/storage/rclone.py
Besides the artifacts sync being atomic, I'd like to add to that "atomic-ity" the update of all the objects in the database. This includes, for example, updating the readthedocs.org/readthedocs/projects/tasks/builds.py Lines 582 to 601 in 362f367
I'd consider the following architecture:
This is probably something that would keep in mind. It seems not to hard to implement (requires a new Actually, if we implement this HASH idea, the "web task" that I'm proposing before, it's not required since the latest API call to update the |
Yea, I think the DB updates should certainly be atomic as well. Your architecture makes sense. 👍 |
Cool! I put it in Q1, but it could be Q2 as well. We will have more data about uploading errors once we enable |
Noted in chat, but unfortunately S3 doesn't support move operations on directories. It's quite annoying in this regard. The directory in S3 is more like a filename prefix than directory structure. There is a "move" operation on files, but then we're in the same problem with non-atomic moves. Further, I'm mostly certain "move" isn't even a rename operation, but instead copy and delete operations -- so there will be a further penalty here in addition. I may be wrong here too, but this particular issue goes back a ways. Here's an SO thread with a lot of the same answers: https://stackoverflow.com/questions/21184720/how-to-rename-files-and-folder-in-amazon-s3 |
This would likely be the best option. If S3 supported directory move operations, atomic bulk file operations, or even just symlinks, we'd have much less work here. Manuel noted a good point, we lose the benefit of rclone in this case however. |
Yea, I think storing it by commit is likely going to be a benefit, but there's certainly some tradeoffs around bandwidth savings. There are some smarter things we could do here with comparison of md5's and things across versions, but something to think more about when we go to actually implement this. ( |
What's the current feeling about this issue? IMO, I feel pretty happy with our current implementation of It's worth to mention that HTML pages including the commit hash on them, won't have this benefit and they will need to be re-uploaded each time. See readthedocs/readthedocs-sphinx-ext#119 and https://github.com/readthedocs/sphinx_rtd_theme/blob/master/sphinx_rtd_theme/footer.html#L36 |
Same, I haven't seen errors at upload in a while, and I suspect that if we did it would be intermittent or a larger failure, like S3 being down/etc. I feel that we should probably just be putting our efforts into ensuring we're catching errors with the build process at the upload step and adjusting as we need. Atomic updates to hosted files will take a lot of engineering and we probably have a number of easier solutions to try if we do start noticing more failures here. |
Yea, I think the downsides of the upload speed have been addressed. The main benefits from this work would be some product features we could enable (eg. instant version rollbacks: #9947 (comment)), but they are quite large changes, and probably downstream of a large architectural change. So probably not worth prioritizing in the near future unless we get strong user demand for these features. |
We are failing the build when this happens, so we are clearly communicating to the users there was a problem and the "build output could be corrupted". We could add more docs around it to communicate them what to do or similar (basically trigger another build to override all the files again 😄 ). Also, these errors are being logged to Sentry and New Relic as well, so we can check how often they are happening. We had only 1 error since February: https://onenr.io/08wpYB5BWRO |
This is a really good enterprise feature 👍🏼 . As you said, the work required for this is not trivial and I wouldn't start doing anything unless we get more tracking from users and we can monetize this feature somehow. I'm removing it from the roadmap for now and de-prioritize it, but we can probably close it as well. |
I think it's still a useful feature to build eventually, so I think we can keep it open, but 👍 on not working on it now. |
When uploading
_readthedocs/
paths, if we have a problem uploading any file, we fail the build. This means that it could happen that artifact files were uploaded while some of them no, for the same build. This results into inconsistent output in production (half of the HTML file may be updated, and the other half not).rclone
/boto3
API?Reference: #9941 (review)
The text was updated successfully, but these errors were encountered: