Skip to content

Build: upload artifacts atomically (support instant rollbacks) #9947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
humitos opened this issue Jan 26, 2023 · 15 comments
Open

Build: upload artifacts atomically (support instant rollbacks) #9947

humitos opened this issue Jan 26, 2023 · 15 comments
Labels
Feature New feature Needed: design decision A core team decision is required Priority: low Low priority

Comments

@humitos
Copy link
Member

humitos commented Jan 26, 2023

When uploading _readthedocs/ paths, if we have a problem uploading any file, we fail the build. This means that it could happen that artifact files were uploaded while some of them no, for the same build. This results into inconsistent output in production (half of the HTML file may be updated, and the other half not).

  • Is it possible to do atomic uploads using rclone / boto3 API?
  • What's the best way to communicate this non-atomic behavior to our users?
  • In case we detect a the upload failed at some part of the process, should we retry after some minutes?

Reference: #9941 (review)

@stsewd
Copy link
Member

stsewd commented Jan 26, 2023

Doesn't look like rclone has that "feature", AWS either https://stackoverflow.com/questions/5877190/does-amazon-s3-support-multi-file-atomic-uploads (they do have this per-file).

I'd say we should just focus on the retry, rclone has a default of 3 retries https://rclone.org/docs/#retries-int

@humitos
Copy link
Member Author

humitos commented Jan 26, 2023

Sounds good for a first step to me 👍🏼 . We can also add --retries-sleep=1 as well, so we don't try immediately after it fails but we wait a little at least.

@ericholscher
Copy link
Member

ericholscher commented Jan 26, 2023

Yea, I think the final version of this would be something like:

  • Upload format into a commit hash or build id directory in s3 (<project>/<hash>/pdf/)
  • At the end of the build, rename that directory to the "latest". Or some other way of handling an atomic rename with 0 downtime.
  • Only pass the build once this "swap" has been done

Alternatively we could keep the build at <project>/<hash>/<format>, and swap some DB record that tells proxito where to point. There's a couple other cool features this could give us like atomic rollbacks for X number of builds, if we wanted. 🤔

Definitely some interesting possible ideas here.

humitos added a commit that referenced this issue Jan 27, 2023
* Build: rclone retries when uploading artifacts

Related #9947

* Update readthedocs/storage/rclone.py
@humitos humitos added the Improvement Minor improvement to code label Jan 30, 2023
@humitos
Copy link
Member Author

humitos commented Jan 30, 2023

Besides the artifacts sync being atomic, I'd like to add to that "atomic-ity" the update of all the objects in the database. This includes, for example, updating the Version object under on_success at

# NOTE: we are updating the db version instance *only* when
# TODO: remove this condition and *always* update the DB Version instance
if "html" in valid_artifacts:
try:
api_v2.version(self.data.version.pk).patch(
{
"built": True,
"documentation_type": self.data.version.documentation_type,
"has_pdf": "pdf" in valid_artifacts,
"has_epub": "epub" in valid_artifacts,
"has_htmlzip": "htmlzip" in valid_artifacts,
}
)
except HttpClientError:
# NOTE: I think we should fail the build if we cannot update
# the version at this point. Otherwise, we will have inconsistent data
log.exception(
"Updating version db object failed. "
'Files are synced in the storage, but "Version" object is not updated',
)

I'd consider the following architecture:

  1. the build instance
  • performs all the build steps
  • uploads the artifacts to <hash> (as Eric suggested)
  • triggers a Celery task to finalize the build
  1. the web
  • executes this task
  • moves the <hash> directory to the final place
  • updates the db (directly hits the db, instead of via API) objects required

Alternatively we could keep the build at <project>/<hash>/<format>, and swap some DB record that tells proxito where to point. There's a couple other cool features this could give us like atomic rollbacks for X number of builds, if we wanted. thinking

This is probably something that would keep in mind. It seems not to hard to implement (requires a new Version.hash field) that will be kept updated with the hash of the latest build performed for that version. Definitely, an extra feature, tho.

Actually, if we implement this HASH idea, the "web task" that I'm proposing before, it's not required since the latest API call to update the Version object will dictate if the build was success or failed 👍🏼 . It reduces the complexity 💯

@ericholscher
Copy link
Member

Yea, I think the DB updates should certainly be atomic as well. Your architecture makes sense. 👍

@humitos humitos added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Jan 31, 2023
@humitos
Copy link
Member Author

humitos commented Jan 31, 2023

Cool! I put it in Q1, but it could be Q2 as well. We will have more data about uploading errors once we enable rclone for all the projects. I'm assuming that rclone + retries will reduce the errors a lot, so I'm not considering this issue a huge priority.

@agjohnson
Copy link
Contributor

Noted in chat, but unfortunately S3 doesn't support move operations on directories. It's quite annoying in this regard. The directory in S3 is more like a filename prefix than directory structure.

There is a "move" operation on files, but then we're in the same problem with non-atomic moves. Further, I'm mostly certain "move" isn't even a rename operation, but instead copy and delete operations -- so there will be a further penalty here in addition.

I may be wrong here too, but this particular issue goes back a ways. Here's an SO thread with a lot of the same answers:

https://stackoverflow.com/questions/21184720/how-to-rename-files-and-folder-in-amazon-s3

@agjohnson
Copy link
Contributor

agjohnson commented Feb 16, 2023

Alternatively we could keep the build at //, and swap some DB record that tells proxito where to point.

This would likely be the best option. If S3 supported directory move operations, atomic bulk file operations, or even just symlinks, we'd have much less work here. Manuel noted a good point, we lose the benefit of rclone in this case however.

@ericholscher
Copy link
Member

Yea, I think storing it by commit is likely going to be a benefit, but there's certainly some tradeoffs around bandwidth savings. There are some smarter things we could do here with comparison of md5's and things across versions, but something to think more about when we go to actually implement this. (/rtd-builds/<project>/_static/ could exist outside of the version, for example, but it gets weird/complicated pretty quickly)

@humitos
Copy link
Member Author

humitos commented Jul 3, 2023

What's the current feeling about this issue?

IMO, I feel pretty happy with our current implementation of rclone and I haven't seen errors around it or having non-synced documentation after performing the build. rclone has been working in a pretty reliable way. Thanks to the bandwidth savings we are giving a better UX to our users, in particular to those projects that have lot of files and include a bunch of images.

It's worth to mention that HTML pages including the commit hash on them, won't have this benefit and they will need to be re-uploaded each time. See readthedocs/readthedocs-sphinx-ext#119 and https://github.com/readthedocs/sphinx_rtd_theme/blob/master/sphinx_rtd_theme/footer.html#L36

@agjohnson
Copy link
Contributor

Same, I haven't seen errors at upload in a while, and I suspect that if we did it would be intermittent or a larger failure, like S3 being down/etc.

I feel that we should probably just be putting our efforts into ensuring we're catching errors with the build process at the upload step and adjusting as we need. Atomic updates to hosted files will take a lot of engineering and we probably have a number of easier solutions to try if we do start noticing more failures here.

@ericholscher
Copy link
Member

ericholscher commented Jul 5, 2023

Yea, I think the downsides of the upload speed have been addressed. The main benefits from this work would be some product features we could enable (eg. instant version rollbacks: #9947 (comment)), but they are quite large changes, and probably downstream of a large architectural change. So probably not worth prioritizing in the near future unless we get strong user demand for these features.

@humitos
Copy link
Member Author

humitos commented Jul 6, 2023

@agjohnson

I feel that we should probably just be putting our efforts into ensuring we're catching errors with the build process at the upload step and adjusting as we need

We are failing the build when this happens, so we are clearly communicating to the users there was a problem and the "build output could be corrupted". We could add more docs around it to communicate them what to do or similar (basically trigger another build to override all the files again 😄 ).

Also, these errors are being logged to Sentry and New Relic as well, so we can check how often they are happening. We had only 1 error since February: https://onenr.io/08wpYB5BWRO

@humitos
Copy link
Member Author

humitos commented Jul 6, 2023

@ericholscher

The main benefits from this work would be some product features we could enable (eg. instant version rollbacks)

This is a really good enterprise feature 👍🏼 . As you said, the work required for this is not trivial and I wouldn't start doing anything unless we get more tracking from users and we can monetize this feature somehow. I'm removing it from the roadmap for now and de-prioritize it, but we can probably close it as well.

@humitos humitos added Feature New feature Needed: design decision A core team decision is required and removed Improvement Minor improvement to code Accepted Accepted issue on our roadmap labels Jul 6, 2023
@humitos humitos added the Priority: low Low priority label Jul 6, 2023
@humitos humitos removed this from 📍Roadmap Jul 6, 2023
@humitos humitos changed the title Build: upload artifacts atomically Build: upload artifacts atomically (support instant rollbacks) Jul 6, 2023
@ericholscher
Copy link
Member

I think it's still a useful feature to build eventually, so I think we can keep it open, but 👍 on not working on it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature Needed: design decision A core team decision is required Priority: low Low priority
Projects
None yet
Development

No branches or pull requests

4 participants