Skip to content

Store ePubs and PDFs in media storage #4947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Feb 26, 2019

Conversation

davidfischer
Copy link
Contributor

@davidfischer davidfischer commented Dec 3, 2018

Adds a syncer that stores PDFs, ePubs and zip files in media storage (Azure storage, S3, etc.). Media storage is set by DEFAULT_FILE_STORAGE although the syncer ignores this if the setting is backed by the local filesystem (which is the Django default).

This should allow us to move ~1-2TB of PDFs/ePubs/zips off to blob storage rather than being on the local webs. The download links now check if the item is in storage and fall back to checking disk. This will allow us to deploy the change without moving all the files. When a build for a pre-existing version happens, the result PDF/ePub/zip will be copied to media storage and then deleted off the local web.

Overall, this concept where some build artifacts are stored in storage and others aren't is a bit weird (rather than being all on storage or all on web) but given that's what we want I'm not sure what the alternative is.

Testing this is somewhat challenging and I'm open to suggestions on making that part of this better.

@davidfischer davidfischer added the PR: work in progress Pull request is not ready for full review label Dec 3, 2018
@davidfischer davidfischer requested a review from a team December 3, 2018 21:16
@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #4947 into master will decrease coverage by 0.12%.
The diff coverage is 48.21%.

@@            Coverage Diff             @@
##           master    #4947      +/-   ##
==========================================
- Coverage   76.93%   76.81%   -0.13%     
==========================================
  Files         158      158              
  Lines       10039    10084      +45     
  Branches     1259     1265       +6     
==========================================
+ Hits         7724     7746      +22     
- Misses       1981     2003      +22     
- Partials      334      335       +1
Impacted Files Coverage Δ
readthedocs/projects/tasks.py 71.2% <0%> (ø) ⬆️
readthedocs/projects/models.py 85.77% <100%> (+0.32%) ⬆️
readthedocs/builds/syncers.py 34.86% <27.58%> (-2.64%) ⬇️
readthedocs/projects/views/public.py 69.18% <71.42%> (-0.28%) ⬇️

@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #4947 into master will increase coverage by 0.06%.
The diff coverage is 48.21%.

@@            Coverage Diff             @@
##           master    #4947      +/-   ##
==========================================
+ Coverage   76.74%   76.81%   +0.06%     
==========================================
  Files         158      158              
  Lines        9951    10084     +133     
  Branches     1244     1265      +21     
==========================================
+ Hits         7637     7746     +109     
- Misses       1980     2003      +23     
- Partials      334      335       +1
Impacted Files Coverage Δ
readthedocs/projects/tasks.py 71.2% <0%> (ø) ⬆️
readthedocs/projects/models.py 85.77% <100%> (+0.21%) ⬆️
readthedocs/builds/syncers.py 34.86% <27.58%> (-2.64%) ⬇️
readthedocs/projects/views/public.py 69.18% <71.42%> (-0.28%) ⬇️
readthedocs/restapi/views/model_views.py 94.18% <0%> (-0.94%) ⬇️
readthedocs/doc_builder/python_environments.py 82.97% <0%> (-0.47%) ⬇️
readthedocs/builds/models.py 78.41% <0%> (-0.42%) ⬇️
readthedocs/gold/views.py 66.17% <0%> (ø) ⬆️
readthedocs/projects/constants.py 100% <0%> (ø) ⬆️
... and 7 more

@ericholscher
Copy link
Member

Sending this sync command to only a single web as opposed to all of them. I don't really know how best to do this.

Putting things on the web queue will make them happen on one random web, putting them on web0xwill make them happen on a specific web, and using the broadcast function will make them happen on all webs (eg queue on web0[1-3])

@ericholscher
Copy link
Member

Testing this is somewhat challenging and I'm open to suggestions on making that part of this better.

I think we should be able to at least test the logic around redirecting and checking for storage along with local files. The actual build process stuff is definitely a bit trickier.

'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip')))
return HttpResponseRedirect(path)
storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug)
if storage.exists(storage_path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the item is not in storage but is on disk, this will result in an extra network call. storage.exists results in an API call to Azure (or wherever storage is backed). However, it should be reasonably fast since it's in the same data center. In the case where media is backed by the local file system, this is just an os.path.isfile so not a big deal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine, and hopefully we will get files into the storage quickly. I'd be 👍 on syncing all the media files after we ship the code, so that we rarely hit this case, as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run into an issue locally, where I have a project with ~100 active versions. It required 100 calls to Azure in order to load the project downloads page. This feels less than ideal -- I think the only way to get around it is to denormalize the "has_pdf" state onto the Version model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems suboptimal. Perhaps we could check whether the last build for a version built PDFs and was successful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to cache this lookup then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to cache this lookup then?

My guess is no. Checking whether any specific epub or pdf exists is not very common.

@davidfischer
Copy link
Contributor Author

Putting things on the web queue will make them happen on one random web, putting them on web0xwill make them happen on a specific web, and using the broadcast function will make them happen on all webs (eg queue on web0[1-3])

Part of the issue is that some of the copying needs to happen on all webs (HTML, etc.) while some should only happen on a single web. I hate to handle things with special cases but maybe PDFs/ePubs need to be a separate task.

@stale
Copy link

stale bot commented Feb 1, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Feb 1, 2019
@ericholscher ericholscher added the Accepted Accepted issue on our roadmap label Feb 5, 2019
@stale stale bot removed the Status: stale Issue will be considered inactive soon label Feb 5, 2019
def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=arguments-differ
RemotePuller.copy(path, target, host, is_file, **kwargs)

if isinstance(storage, FileSystemStorage):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best check for whether or not we should execute this logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same logic in 8a602c8 to test for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit of weird logic but I'm not sure what the alternative is. If the storage is local, we don't want to copy/delete the file. If it is, we do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:

class LocalFileSystemStorage(FileSystemStorage):
    can_sync = True

if getattr(storage, 'can_sync', False):
    pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking a bit more on this, I don't think it'll solve the problem. The issue is that things are very different if storage is backed by cloud blob storage vs. the local filesystem.

If storage is backed by cloud storage, only 1 web needs to sync any given PDF/Epub/zip to the cloud. Then, if once the copy is done, the original local file can be deleted.

If storage is backed by the filesystem, then the files have already been synced to the right place by the super class. We don't want to delete the local file as that's the file we're doing to serve. We also don't want to do a copy to "storage" as that's a copy from/to the same file.

This is all regardless of what the default syncer/puller is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thinking a bit more, maybe this is possible. We don't want to sync PDFs/ePubs for local filesystem storage at all (they're already synced by the puller/syncer). However, we could add a flag to the storage (AzureMediaStorage for us) specifically for this. That's probably better than doing an ininstance check.

storage.save(storage_path, fd)

# remove the original after copying
os.remove(target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove this for now, we should be able to deploy this, and keep a copy on the webs as well as in Azure? Seems like that might be a good idea for transitioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

if storage.exists(storage_path):
return HttpResponseRedirect(storage.url(storage_path))

media_path = os.path.join(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe this logic is actually what the DefaultFileStorage will do, so I don't think this will ever get hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will. In the case where the storage is Azure storage but the PDF isn't yet on Azure (it's only locally on the webs) then this logic is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. 👍

@ericholscher
Copy link
Member

Looks like we likely have an issue deleting the existing PDF/epub files. I don't believe we updated this logic:

https://github.com/rtfd/readthedocs.org/blob/b500528541ccd0cead927c03547487edcfa2c1da/readthedocs/projects/tasks.py#L910-L924

@davidfischer
Copy link
Contributor Author

Looks like we likely have an issue deleting the existing PDF/epub files. I don't believe we updated this logic:

I can try to take a look at that. I don't think it will be that big of a deal.

@ericholscher
Copy link
Member

I can try to take a look at that. I don't think it will be that big of a deal.

The tricky thing is the split logic between doing it across all webs or just a single one. I think perhaps moving it into the move_files function with a delete argument is the easiest way to make it work.

@ericholscher
Copy link
Member

I believe this is mostly ready to go. I think it would be 💯 to get it shipped this week.

@davidfischer davidfischer removed the PR: work in progress Pull request is not ready for full review label Feb 12, 2019
Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this approach looks good! Just to clarify, the RemotePuller subclass here will execute on a single web server, correct?

def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=arguments-differ
RemotePuller.copy(path, target, host, is_file, **kwargs)

if isinstance(storage, FileSystemStorage):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:

class LocalFileSystemStorage(FileSystemStorage):
    can_sync = True

if getattr(storage, 'can_sync', False):
    pass

version_slug,
self.slug,
extension,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we strictly talking filesystem path here, or URL path, or both? If filesystem, this should use os.path.join()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a storage check, not a filesystem check.

'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip')))
return HttpResponseRedirect(path)
storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug)
if storage.exists(storage_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to cache this lookup then?

@davidfischer
Copy link
Contributor Author

Looks like we likely have an issue deleting the existing PDF/epub files. I don't believe we updated this logic:

Just so I'm 100% clear on this, this code is taken when a new build occurs and the user has disabled PDF/ePub building? So we want to delete the PDF/ePub so it isn't old, correct?

- Basically we need to explicitly tell the storage to sync build media
@ericholscher
Copy link
Member

Just so I'm 100% clear on this, this code is taken when a new build occurs and the user has disabled PDF/ePub building? So we want to delete the PDF/ePub so it isn't old, correct?

Yep. When users turn off epub/pdf on their builds, we should stop serving the old ones, otherwise they get out of date really quickly and confuse users.

version_slug=version.slug,
),
])
if delete:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an issue around this. While I was testing the removal feature, this flag was always False. I'm actually not 100% sure why we need this flag at all. Wouldn't we always want to remove old PDFs/ePubs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only be true when being called w/ the broadcast. I guess we could delete the old ones before replacing them with the new ones, but that seems like wasted effort, but arguably could be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could delete the old ones before replacing them with the new ones, but that seems like wasted effort, but arguably could be done.

For storage, you have to do this. There isn't an "overwrite" option.

Copy link
Contributor Author

@davidfischer davidfischer Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only be true when being called w/ the broadcast.

Is the implication here that this won't happen when run locally? I wasn't able to successfully test deleting artifacts when PDF/Epubs are disabled locally.

Edit: I tested it, but it didn't work correctly.

ericholscher
ericholscher previously approved these changes Feb 25, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this locally and it works 👍

My only thought is if we want to comingle the build output container with internal cold storage JSON with the media files. I wonder if we should create another class in ext and use that in prod? So that we can give it it's own container?

Did that here: https://github.com/rtfd/readthedocs-ext/pull/226

@ericholscher ericholscher requested a review from a team February 26, 2019 13:42
ericholscher
ericholscher previously approved these changes Feb 26, 2019
@davidfischer
Copy link
Contributor Author

My only thought is if we want to comingle the build output container with internal cold storage JSON with the media files. I wonder if we should create another class in ext and use that in prod? So that we can give it it's own container?

I think media having its own container is good. I merged it.

@davidfischer
Copy link
Contributor Author

As the author, I can't approve my own PR but I tested your changes and they work for me as well. I did a build with PDFs/ePubs, saw them in blob storage, turned off PDFs/ePubs, ran a build and saw that those files were deleted.

One thing about this PR is that there is a lot of extra logic to selectively write to blob storage (PDFs, ePubs, but not HTML) and some extra logic to delete files that should be deleted (we should run the delete from 1 web not all to avoid unnecessary duplication). I suspect that this whole process would be simpler we just wrote all build output to blob storage and didn't have these special cases. I'm looking forward to those enhancements.

@ericholscher
Copy link
Member

One thing about this PR is that there is a lot of extra logic to selectively write to blob storage (PDFs, ePubs, but not HTML) and some extra logic to delete files that should be deleted (we should run the delete from 1 web not all to avoid unnecessary duplication). I suspect that this whole process would be simpler we just wrote all build output to blob storage and didn't have these special cases. I'm looking forward to those enhancements.

Agreed. It would be pretty trivial to just sync everything for now, but keep all of it syncing to the webs, if we wanted. That way the HTML & JSON is there when we're ready for it.

@ericholscher
Copy link
Member

Merging this, and we can iterate on it once we have more experience w/ it in prod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants