Store ePubs and PDFs in media storage #4947

davidfischer · 2018-12-03T21:16:31Z

Adds a syncer that stores PDFs, ePubs and zip files in media storage (Azure storage, S3, etc.). Media storage is set by DEFAULT_FILE_STORAGE although the syncer ignores this if the setting is backed by the local filesystem (which is the Django default).

This should allow us to move ~1-2TB of PDFs/ePubs/zips off to blob storage rather than being on the local webs. The download links now check if the item is in storage and fall back to checking disk. This will allow us to deploy the change without moving all the files. When a build for a pre-existing version happens, the result PDF/ePub/zip will be copied to media storage and then deleted off the local web.

Overall, this concept where some build artifacts are stored in storage and others aren't is a bit weird (rather than being all on storage or all on web) but given that's what we want I'm not sure what the alternative is.

Testing this is somewhat challenging and I'm open to suggestions on making that part of this better.

codecov · 2018-12-03T21:29:32Z

Codecov Report

Merging #4947 into master will decrease coverage by 0.12%.
The diff coverage is 48.21%.

@@            Coverage Diff             @@
##           master    #4947      +/-   ##
==========================================
- Coverage   76.93%   76.81%   -0.13%     
==========================================
  Files         158      158              
  Lines       10039    10084      +45     
  Branches     1259     1265       +6     
==========================================
+ Hits         7724     7746      +22     
- Misses       1981     2003      +22     
- Partials      334      335       +1

Impacted Files	Coverage Δ
readthedocs/projects/tasks.py	`71.2% <0%> (ø)`	⬆️
readthedocs/projects/models.py	`85.77% <100%> (+0.32%)`	⬆️
readthedocs/builds/syncers.py	`34.86% <27.58%> (-2.64%)`	⬇️
readthedocs/projects/views/public.py	`69.18% <71.42%> (-0.28%)`	⬇️

codecov · 2018-12-03T21:29:32Z

Codecov Report

Merging #4947 into master will increase coverage by 0.06%.
The diff coverage is 48.21%.

@@            Coverage Diff             @@
##           master    #4947      +/-   ##
==========================================
+ Coverage   76.74%   76.81%   +0.06%     
==========================================
  Files         158      158              
  Lines        9951    10084     +133     
  Branches     1244     1265      +21     
==========================================
+ Hits         7637     7746     +109     
- Misses       1980     2003      +23     
- Partials      334      335       +1

Impacted Files	Coverage Δ
readthedocs/projects/tasks.py	`71.2% <0%> (ø)`	⬆️
readthedocs/projects/models.py	`85.77% <100%> (+0.21%)`	⬆️
readthedocs/builds/syncers.py	`34.86% <27.58%> (-2.64%)`	⬇️
readthedocs/projects/views/public.py	`69.18% <71.42%> (-0.28%)`	⬇️
readthedocs/restapi/views/model_views.py	`94.18% <0%> (-0.94%)`	⬇️
readthedocs/doc_builder/python_environments.py	`82.97% <0%> (-0.47%)`	⬇️
readthedocs/builds/models.py	`78.41% <0%> (-0.42%)`	⬇️
readthedocs/gold/views.py	`66.17% <0%> (ø)`	⬆️
readthedocs/projects/constants.py	`100% <0%> (ø)`	⬆️
... and 7 more

ericholscher · 2018-12-04T13:41:23Z

Sending this sync command to only a single web as opposed to all of them. I don't really know how best to do this.

Putting things on the web queue will make them happen on one random web, putting them on web0xwill make them happen on a specific web, and using the broadcast function will make them happen on all webs (eg queue on web0[1-3])

ericholscher · 2018-12-04T14:00:59Z

Testing this is somewhat challenging and I'm open to suggestions on making that part of this better.

I think we should be able to at least test the logic around redirecting and checking for storage along with local files. The actual build process stuff is definitely a bit trickier.

davidfischer · 2018-12-04T18:44:33Z

readthedocs/projects/views/public.py

-            '%s.%s' % (project_slug, type_.replace('htmlzip', 'zip')))
-        return HttpResponseRedirect(path)
+        storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug)
+        if storage.exists(storage_path):


In the case where the item is not in storage but is on disk, this will result in an extra network call. storage.exists results in an API call to Azure (or wherever storage is backed). However, it should be reasonably fast since it's in the same data center. In the case where media is backed by the local file system, this is just an os.path.isfile so not a big deal.

That seems fine, and hopefully we will get files into the storage quickly. I'd be 👍 on syncing all the media files after we ship the code, so that we rarely hit this case, as well.

I did run into an issue locally, where I have a project with ~100 active versions. It required 100 calls to Azure in order to load the project downloads page. This feels less than ideal -- I think the only way to get around it is to denormalize the "has_pdf" state onto the Version model.

That seems suboptimal. Perhaps we could check whether the last build for a version built PDFs and was successful?

Would it also make sense to cache this lookup then?

Would it also make sense to cache this lookup then?

My guess is no. Checking whether any specific epub or pdf exists is not very common.

davidfischer · 2018-12-17T22:37:16Z

Putting things on the web queue will make them happen on one random web, putting them on web0xwill make them happen on a specific web, and using the broadcast function will make them happen on all webs (eg queue on web0[1-3])

Part of the issue is that some of the copying needs to happen on all webs (HTML, etc.) while some should only happen on a single web. I hate to handle things with special cases but maybe PDFs/ePubs need to be a separate task.

stale · 2019-02-01T10:59:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

…e-epubs-pdfs-zips

ericholscher · 2019-02-05T14:22:20Z

readthedocs/builds/syncers.py

+    def copy(cls, path, target, host, is_file=False, **kwargs):  # pylint: disable=arguments-differ
+        RemotePuller.copy(path, target, host, is_file, **kwargs)
+
+        if isinstance(storage, FileSystemStorage):


Is this the best check for whether or not we should execute this logic?

I used the same logic in 8a602c8 to test for this.

It is a bit of weird logic but I'm not sure what the alternative is. If the storage is local, we don't want to copy/delete the file. If it is, we do.

If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:

class LocalFileSystemStorage(FileSystemStorage): can_sync = True if getattr(storage, 'can_sync', False): pass

I like that idea!

Actually thinking a bit more on this, I don't think it'll solve the problem. The issue is that things are very different if storage is backed by cloud blob storage vs. the local filesystem.

If storage is backed by cloud storage, only 1 web needs to sync any given PDF/Epub/zip to the cloud. Then, if once the copy is done, the original local file can be deleted.

If storage is backed by the filesystem, then the files have already been synced to the right place by the super class. We don't want to delete the local file as that's the file we're doing to serve. We also don't want to do a copy to "storage" as that's a copy from/to the same file.

This is all regardless of what the default syncer/puller is.

Ok, thinking a bit more, maybe this is possible. We don't want to sync PDFs/ePubs for local filesystem storage at all (they're already synced by the puller/syncer). However, we could add a flag to the storage (AzureMediaStorage for us) specifically for this. That's probably better than doing an ininstance check.

ericholscher · 2019-02-05T14:45:56Z

readthedocs/builds/syncers.py

+                storage.save(storage_path, fd)
+
+            # remove the original after copying
+            os.remove(target)


If we remove this for now, we should be able to deploy this, and keep a copy on the webs as well as in Azure? Seems like that might be a good idea for transitioning.

ericholscher · 2019-02-05T14:52:26Z

readthedocs/projects/views/public.py

+        if storage.exists(storage_path):
+            return HttpResponseRedirect(storage.url(storage_path))
+
+        media_path = os.path.join(


Believe this logic is actually what the DefaultFileStorage will do, so I don't think this will ever get hit.

Yes it will. In the case where the storage is Azure storage but the PDF isn't yet on Azure (it's only locally on the webs) then this logic is used.

Good point. 👍

ericholscher · 2019-02-05T15:00:01Z

Looks like we likely have an issue deleting the existing PDF/epub files. I don't believe we updated this logic:

https://github.com/rtfd/readthedocs.org/blob/b500528541ccd0cead927c03547487edcfa2c1da/readthedocs/projects/tasks.py#L910-L924

davidfischer · 2019-02-05T16:49:56Z

Looks like we likely have an issue deleting the existing PDF/epub files. I don't believe we updated this logic:

I can try to take a look at that. I don't think it will be that big of a deal.

ericholscher · 2019-02-05T16:51:24Z

I can try to take a look at that. I don't think it will be that big of a deal.

The tricky thing is the split logic between doing it across all webs or just a single one. I think perhaps moving it into the move_files function with a delete argument is the easiest way to make it work.

ericholscher · 2019-02-12T16:50:00Z

I believe this is mostly ready to go. I think it would be 💯 to get it shipped this week.

agjohnson

I believe this approach looks good! Just to clarify, the RemotePuller subclass here will execute on a single web server, correct?

agjohnson · 2019-02-12T17:23:34Z

readthedocs/builds/syncers.py

+    def copy(cls, path, target, host, is_file=False, **kwargs):  # pylint: disable=arguments-differ
+        RemotePuller.copy(path, target, host, is_file, **kwargs)
+
+        if isinstance(storage, FileSystemStorage):


If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:

class LocalFileSystemStorage(FileSystemStorage): can_sync = True if getattr(storage, 'can_sync', False): pass

agjohnson · 2019-02-12T17:26:42Z

readthedocs/projects/models.py

+            version_slug,
+            self.slug,
+            extension,
+        )


Are we strictly talking filesystem path here, or URL path, or both? If filesystem, this should use os.path.join()

This is a storage check, not a filesystem check.

agjohnson · 2019-02-12T17:32:15Z

readthedocs/projects/views/public.py

-            '%s.%s' % (project_slug, type_.replace('htmlzip', 'zip')))
-        return HttpResponseRedirect(path)
+        storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug)
+        if storage.exists(storage_path):


Would it also make sense to cache this lookup then?

davidfischer · 2019-02-13T00:09:39Z

Looks like we likely have an issue deleting the existing PDF/epub files. I don't believe we updated this logic:

Just so I'm 100% clear on this, this code is taken when a new build occurs and the user has disabled PDF/ePub building? So we want to delete the PDF/ePub so it isn't old, correct?

- Basically we need to explicitly tell the storage to sync build media

ericholscher · 2019-02-13T12:50:26Z

Just so I'm 100% clear on this, this code is taken when a new build occurs and the user has disabled PDF/ePub building? So we want to delete the PDF/ePub so it isn't old, correct?

Yep. When users turn off epub/pdf on their builds, we should stop serving the old ones, otherwise they get out of date really quickly and confuse users.

davidfischer · 2019-02-13T20:11:08Z

readthedocs/projects/tasks.py

-                version_slug=version.slug,
-            ),
-        ])
+    if delete:


I think there's an issue around this. While I was testing the removal feature, this flag was always False. I'm actually not 100% sure why we need this flag at all. Wouldn't we always want to remove old PDFs/ePubs?

It should only be true when being called w/ the broadcast. I guess we could delete the old ones before replacing them with the new ones, but that seems like wasted effort, but arguably could be done.

I guess we could delete the old ones before replacing them with the new ones, but that seems like wasted effort, but arguably could be done.

For storage, you have to do this. There isn't an "overwrite" option.

It should only be true when being called w/ the broadcast.

Is the implication here that this won't happen when run locally? I wasn't able to successfully test deleting artifacts when PDF/Epubs are disabled locally.

Edit: I tested it, but it didn't work correctly.

…e-epubs-pdfs-zips

ericholscher

Tested this locally and it works 👍

My only thought is if we want to comingle the build output container with internal cold storage JSON with the media files. I wonder if we should create another class in ext and use that in prod? So that we can give it it's own container?

Did that here: https://github.com/rtfd/readthedocs-ext/pull/226

Only do it when we really mean it.

davidfischer · 2019-02-26T17:42:03Z

My only thought is if we want to comingle the build output container with internal cold storage JSON with the media files. I wonder if we should create another class in ext and use that in prod? So that we can give it it's own container?

I think media having its own container is good. I merged it.

davidfischer · 2019-02-26T17:45:19Z

As the author, I can't approve my own PR but I tested your changes and they work for me as well. I did a build with PDFs/ePubs, saw them in blob storage, turned off PDFs/ePubs, ran a build and saw that those files were deleted.

One thing about this PR is that there is a lot of extra logic to selectively write to blob storage (PDFs, ePubs, but not HTML) and some extra logic to delete files that should be deleted (we should run the delete from 1 web not all to avoid unnecessary duplication). I suspect that this whole process would be simpler we just wrote all build output to blob storage and didn't have these special cases. I'm looking forward to those enhancements.

ericholscher · 2019-02-26T17:58:03Z

One thing about this PR is that there is a lot of extra logic to selectively write to blob storage (PDFs, ePubs, but not HTML) and some extra logic to delete files that should be deleted (we should run the delete from 1 web not all to avoid unnecessary duplication). I suspect that this whole process would be simpler we just wrote all build output to blob storage and didn't have these special cases. I'm looking forward to those enhancements.

Agreed. It would be pretty trivial to just sync everything for now, but keep all of it syncing to the webs, if we wanted. That way the HTML & JSON is there when we're ready for it.

ericholscher · 2019-02-26T18:14:21Z

Merging this, and we can iterate on it once we have more experience w/ it in prod.

Store ePubs and PDFs in media storage

8f9576d

davidfischer added the PR: work in progress Pull request is not ready for full review label Dec 3, 2018

davidfischer requested a review from a team December 3, 2018 21:16

davidfischer commented Dec 4, 2018

View reviewed changes

Merge branch 'master' into davidfischer/storage-epubs-pdfs-zips

130bc2e

Add basic tests

7a2216e

stale bot added the Status: stale Issue will be considered inactive soon label Feb 1, 2019

ericholscher added 3 commits February 5, 2019 10:30

Merge remote-tracking branch 'origin/master' into davidfischer/storag…

302b3fd

…e-epubs-pdfs-zips

Fix merge error

c3fa90e

Fix another merge error

b500528

ericholscher added the Accepted Accepted issue on our roadmap label Feb 5, 2019

stale bot removed the Status: stale Issue will be considered inactive soon label Feb 5, 2019

ericholscher reviewed Feb 5, 2019

View reviewed changes

ericholscher added 2 commits February 5, 2019 12:13

Send move_files task to only one web when uploading to Azure

8a602c8

Handle deleting on the webs in the normal case

6edacad

davidfischer removed the PR: work in progress Pull request is not ready for full review label Feb 12, 2019

agjohnson reviewed Feb 12, 2019

View reviewed changes

Improved filesystem storage check

03787f4

- Basically we need to explicitly tell the storage to sync build media

davidfischer added 2 commits February 13, 2019 11:21

Merge branch 'master' into davidfischer/storage-epubs-pdfs-zips

a7db138

Remove old PDFs/ePubs when they are no longer built

4fab169

davidfischer commented Feb 13, 2019

View reviewed changes

ericholscher added 4 commits February 25, 2019 17:14

Merge remote-tracking branch 'origin/master' into davidfischer/storag…

9ac1e53

…e-epubs-pdfs-zips

Remove delete logic

5e9597a

Fix delete logic with a comment

440567a

Fix local file delete logic with include_file=False

9ca62db

ericholscher previously approved these changes Feb 25, 2019

View reviewed changes

Lint

f97833b

ericholscher dismissed their stale review via f97833b February 25, 2019 21:26

ericholscher added 2 commits February 25, 2019 18:36

Fix up lint error on docstring

8c14af7

Don’t delete to deleting unsynced media.

34306ad

Only do it when we really mean it.

ericholscher requested a review from a team February 26, 2019 13:42

ericholscher previously approved these changes Feb 26, 2019

View reviewed changes

Fix tests for fix around deleting htmlzip

76456ab

ericholscher dismissed their stale review via 76456ab February 26, 2019 16:34

ericholscher approved these changes Feb 26, 2019

View reviewed changes

ericholscher merged commit cca41b7 into master Feb 26, 2019

delete-merged-branch bot deleted the davidfischer/storage-epubs-pdfs-zips branch February 26, 2019 18:14

stsewd mentioned this pull request Mar 28, 2019

Artifacts aren't cleaned when changing from Sphinx to Mkdocs #4764

Closed

davidfischer mentioned this pull request Mar 29, 2019

Write build artifacts to (cloud) storage from build servers #5549

Merged

Uh oh!

Store ePubs and PDFs in media storage #4947

Store ePubs and PDFs in media storage #4947

Uh oh!

Conversation

davidfischer commented Dec 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 3, 2018

Codecov Report

Uh oh!

codecov bot commented Dec 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ericholscher commented Dec 4, 2018

Uh oh!

ericholscher commented Dec 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidfischer commented Dec 17, 2018

Uh oh!

stale bot commented Feb 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericholscher commented Feb 5, 2019

Uh oh!

davidfischer commented Feb 5, 2019

Uh oh!

ericholscher commented Feb 5, 2019

Uh oh!

ericholscher commented Feb 12, 2019

Uh oh!

agjohnson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidfischer commented Feb 13, 2019

Uh oh!

davidfischer commented Dec 3, 2018 •

edited

Loading

codecov bot commented Dec 3, 2018 •

edited

Loading

davidfischer Feb 14, 2019 •

edited

Loading

ericholscher left a comment •

edited

Loading