Build: upload artifacts atomically (support instant rollbacks) #9947

humitos · 2023-01-26T10:47:09Z

When uploading _readthedocs/ paths, if we have a problem uploading any file, we fail the build. This means that it could happen that artifact files were uploaded while some of them no, for the same build. This results into inconsistent output in production (half of the HTML file may be updated, and the other half not).

Is it possible to do atomic uploads using rclone / boto3 API?
What's the best way to communicate this non-atomic behavior to our users?
In case we detect a the upload failed at some part of the process, should we retry after some minutes?

Reference: #9941 (review)

The text was updated successfully, but these errors were encountered:

stsewd · 2023-01-26T18:02:00Z

Doesn't look like rclone has that "feature", AWS either https://stackoverflow.com/questions/5877190/does-amazon-s3-support-multi-file-atomic-uploads (they do have this per-file).

I'd say we should just focus on the retry, rclone has a default of 3 retries https://rclone.org/docs/#retries-int

humitos · 2023-01-26T18:44:12Z

Sounds good for a first step to me 👍🏼 . We can also add --retries-sleep=1 as well, so we don't try immediately after it fails but we wait a little at least.

Related #9947

ericholscher · 2023-01-26T23:43:29Z

Yea, I think the final version of this would be something like:

Upload format into a commit hash or build id directory in s3 (<project>/<hash>/pdf/)
At the end of the build, rename that directory to the "latest". Or some other way of handling an atomic rename with 0 downtime.
Only pass the build once this "swap" has been done

Alternatively we could keep the build at <project>/<hash>/<format>, and swap some DB record that tells proxito where to point. There's a couple other cool features this could give us like atomic rollbacks for X number of builds, if we wanted. 🤔

Definitely some interesting possible ideas here.

* Build: rclone retries when uploading artifacts Related #9947 * Update readthedocs/storage/rclone.py

humitos · 2023-01-30T12:52:03Z

Besides the artifacts sync being atomic, I'd like to add to that "atomic-ity" the update of all the objects in the database. This includes, for example, updating the Version object under on_success at

readthedocs.org/readthedocs/projects/tasks/builds.py

Lines 582 to 601 in 362f367

    
           # NOTE: we are updating the db version instance *only* when 
        
           # TODO: remove this condition and *always* update the DB Version instance 
        
           if "html" in valid_artifacts: 
        
               try: 
        
                   api_v2.version(self.data.version.pk).patch( 
        
                       { 
        
                           "built": True, 
        
                           "documentation_type": self.data.version.documentation_type, 
        
                           "has_pdf": "pdf" in valid_artifacts, 
        
                           "has_epub": "epub" in valid_artifacts, 
        
                           "has_htmlzip": "htmlzip" in valid_artifacts, 
        
                       } 
        
                   ) 
        
               except HttpClientError: 
        
                   # NOTE: I think we should fail the build if we cannot update 
        
                   # the version at this point. Otherwise, we will have inconsistent data 
        
                   log.exception( 
        
                       "Updating version db object failed. " 
        
                       'Files are synced in the storage, but "Version" object is not updated', 
        
                   )

I'd consider the following architecture:

the build instance

performs all the build steps
uploads the artifacts to <hash> (as Eric suggested)
triggers a Celery task to finalize the build

the web

executes this task
moves the <hash> directory to the final place
updates the db (directly hits the db, instead of via API) objects required

Alternatively we could keep the build at <project>/<hash>/<format>, and swap some DB record that tells proxito where to point. There's a couple other cool features this could give us like atomic rollbacks for X number of builds, if we wanted. thinking

This is probably something that would keep in mind. It seems not to hard to implement (requires a new Version.hash field) that will be kept updated with the hash of the latest build performed for that version. Definitely, an extra feature, tho.

Actually, if we implement this HASH idea, the "web task" that I'm proposing before, it's not required since the latest API call to update the Version object will dictate if the build was success or failed 👍🏼 . It reduces the complexity 💯

ericholscher · 2023-01-30T23:04:57Z

Yea, I think the DB updates should certainly be atomic as well. Your architecture makes sense. 👍

humitos · 2023-01-31T08:33:26Z

Cool! I put it in Q1, but it could be Q2 as well. We will have more data about uploading errors once we enable rclone for all the projects. I'm assuming that rclone + retries will reduce the errors a lot, so I'm not considering this issue a huge priority.

agjohnson · 2023-02-16T20:58:37Z

Noted in chat, but unfortunately S3 doesn't support move operations on directories. It's quite annoying in this regard. The directory in S3 is more like a filename prefix than directory structure.

There is a "move" operation on files, but then we're in the same problem with non-atomic moves. Further, I'm mostly certain "move" isn't even a rename operation, but instead copy and delete operations -- so there will be a further penalty here in addition.

I may be wrong here too, but this particular issue goes back a ways. Here's an SO thread with a lot of the same answers:

https://stackoverflow.com/questions/21184720/how-to-rename-files-and-folder-in-amazon-s3

agjohnson · 2023-02-16T21:05:47Z

Alternatively we could keep the build at //, and swap some DB record that tells proxito where to point.

This would likely be the best option. If S3 supported directory move operations, atomic bulk file operations, or even just symlinks, we'd have much less work here. Manuel noted a good point, we lose the benefit of rclone in this case however.

ericholscher · 2023-02-17T15:42:10Z

Yea, I think storing it by commit is likely going to be a benefit, but there's certainly some tradeoffs around bandwidth savings. There are some smarter things we could do here with comparison of md5's and things across versions, but something to think more about when we go to actually implement this. (/rtd-builds/<project>/_static/ could exist outside of the version, for example, but it gets weird/complicated pretty quickly)

humitos · 2023-07-03T14:39:53Z

What's the current feeling about this issue?

IMO, I feel pretty happy with our current implementation of rclone and I haven't seen errors around it or having non-synced documentation after performing the build. rclone has been working in a pretty reliable way. Thanks to the bandwidth savings we are giving a better UX to our users, in particular to those projects that have lot of files and include a bunch of images.

It's worth to mention that HTML pages including the commit hash on them, won't have this benefit and they will need to be re-uploaded each time. See readthedocs/readthedocs-sphinx-ext#119 and https://github.com/readthedocs/sphinx_rtd_theme/blob/master/sphinx_rtd_theme/footer.html#L36

agjohnson · 2023-07-05T17:41:51Z

Same, I haven't seen errors at upload in a while, and I suspect that if we did it would be intermittent or a larger failure, like S3 being down/etc.

I feel that we should probably just be putting our efforts into ensuring we're catching errors with the build process at the upload step and adjusting as we need. Atomic updates to hosted files will take a lot of engineering and we probably have a number of easier solutions to try if we do start noticing more failures here.

ericholscher · 2023-07-05T21:34:36Z

Yea, I think the downsides of the upload speed have been addressed. The main benefits from this work would be some product features we could enable (eg. instant version rollbacks: #9947 (comment)), but they are quite large changes, and probably downstream of a large architectural change. So probably not worth prioritizing in the near future unless we get strong user demand for these features.

humitos · 2023-07-06T10:40:30Z

@agjohnson

I feel that we should probably just be putting our efforts into ensuring we're catching errors with the build process at the upload step and adjusting as we need

We are failing the build when this happens, so we are clearly communicating to the users there was a problem and the "build output could be corrupted". We could add more docs around it to communicate them what to do or similar (basically trigger another build to override all the files again 😄 ).

Also, these errors are being logged to Sentry and New Relic as well, so we can check how often they are happening. We had only 1 error since February: https://onenr.io/08wpYB5BWRO

humitos · 2023-07-06T10:43:29Z

@ericholscher

The main benefits from this work would be some product features we could enable (eg. instant version rollbacks)

This is a really good enterprise feature 👍🏼 . As you said, the work required for this is not trivial and I wouldn't start doing anything unless we get more tracking from users and we can monetize this feature somehow. I'm removing it from the roadmap for now and de-prioritize it, but we can probably close it as well.

ericholscher · 2023-07-06T16:03:13Z

I think it's still a useful feature to build eventually, so I think we can keep it open, but 👍 on not working on it now.

humitos added the Needed: design decision A core team decision is required label Jan 26, 2023

humitos added this to 📍Roadmap Jan 26, 2023

github-project-automation bot moved this to Planned in 📍Roadmap Jan 26, 2023

humitos mentioned this issue Jan 26, 2023

Build: communicate uploading errors to users #9941

Merged

humitos added a commit that referenced this issue Jan 26, 2023

Build: rclone retries when uploading artifacts

abdd6da

Related #9947

humitos mentioned this issue Jan 26, 2023

Build: rclone retries when uploading artifacts #9954

Merged

humitos added a commit that referenced this issue Jan 27, 2023

Build: rclone retries when uploading artifacts (#9954)

4d5cd56

* Build: rclone retries when uploading artifacts Related #9947 * Update readthedocs/storage/rclone.py

humitos added the Improvement Minor improvement to code label Jan 30, 2023

humitos added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Jan 31, 2023

humitos mentioned this issue Jan 31, 2023

Build: notes from output directory standardization #9918

Closed

ericholscher mentioned this issue Feb 14, 2023

Builds: ignore cancelling the build at "Uploading" state #10006

Merged

humitos added Feature New feature Needed: design decision A core team decision is required and removed Improvement Minor improvement to code Accepted Accepted issue on our roadmap labels Jul 6, 2023

humitos added the Priority: low Low priority label Jul 6, 2023

humitos removed this from 📍Roadmap Jul 6, 2023

humitos changed the title ~~Build: upload artifacts atomically~~ Build: upload artifacts atomically (support instant rollbacks) Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build: upload artifacts atomically (support instant rollbacks) #9947

Build: upload artifacts atomically (support instant rollbacks) #9947

humitos commented Jan 26, 2023

stsewd commented Jan 26, 2023

humitos commented Jan 26, 2023

ericholscher commented Jan 26, 2023 •

edited

Loading

humitos commented Jan 30, 2023

ericholscher commented Jan 30, 2023

humitos commented Jan 31, 2023

agjohnson commented Feb 16, 2023

agjohnson commented Feb 16, 2023 •

edited

Loading

ericholscher commented Feb 17, 2023

humitos commented Jul 3, 2023

agjohnson commented Jul 5, 2023

ericholscher commented Jul 5, 2023 •

edited

Loading

humitos commented Jul 6, 2023

humitos commented Jul 6, 2023

ericholscher commented Jul 6, 2023

Build: upload artifacts atomically (support instant rollbacks) #9947

Build: upload artifacts atomically (support instant rollbacks) #9947

Comments

humitos commented Jan 26, 2023

stsewd commented Jan 26, 2023

humitos commented Jan 26, 2023

ericholscher commented Jan 26, 2023 • edited Loading

humitos commented Jan 30, 2023

ericholscher commented Jan 30, 2023

humitos commented Jan 31, 2023

agjohnson commented Feb 16, 2023

agjohnson commented Feb 16, 2023 • edited Loading

ericholscher commented Feb 17, 2023

humitos commented Jul 3, 2023

agjohnson commented Jul 5, 2023

ericholscher commented Jul 5, 2023 • edited Loading

humitos commented Jul 6, 2023

humitos commented Jul 6, 2023

ericholscher commented Jul 6, 2023

ericholscher commented Jan 26, 2023 •

edited

Loading

agjohnson commented Feb 16, 2023 •

edited

Loading

ericholscher commented Jul 5, 2023 •

edited

Loading