-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Store ePubs and PDFs in media storage #4947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4947 +/- ##
==========================================
- Coverage 76.93% 76.81% -0.13%
==========================================
Files 158 158
Lines 10039 10084 +45
Branches 1259 1265 +6
==========================================
+ Hits 7724 7746 +22
- Misses 1981 2003 +22
- Partials 334 335 +1
|
Codecov Report
@@ Coverage Diff @@
## master #4947 +/- ##
==========================================
+ Coverage 76.74% 76.81% +0.06%
==========================================
Files 158 158
Lines 9951 10084 +133
Branches 1244 1265 +21
==========================================
+ Hits 7637 7746 +109
- Misses 1980 2003 +23
- Partials 334 335 +1
|
Putting things on the |
I think we should be able to at least test the logic around redirecting and checking for storage along with local files. The actual build process stuff is definitely a bit trickier. |
'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip'))) | ||
return HttpResponseRedirect(path) | ||
storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug) | ||
if storage.exists(storage_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where the item is not in storage but is on disk, this will result in an extra network call. storage.exists
results in an API call to Azure (or wherever storage is backed). However, it should be reasonably fast since it's in the same data center. In the case where media is backed by the local file system, this is just an os.path.isfile
so not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine, and hopefully we will get files into the storage quickly. I'd be 👍 on syncing all the media files after we ship the code, so that we rarely hit this case, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did run into an issue locally, where I have a project with ~100 active versions. It required 100 calls to Azure in order to load the project downloads page. This feels less than ideal -- I think the only way to get around it is to denormalize the "has_pdf" state onto the Version model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems suboptimal. Perhaps we could check whether the last build for a version built PDFs and was successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it also make sense to cache this lookup then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it also make sense to cache this lookup then?
My guess is no. Checking whether any specific epub or pdf exists is not very common.
Part of the issue is that some of the copying needs to happen on all webs (HTML, etc.) while some should only happen on a single web. I hate to handle things with special cases but maybe PDFs/ePubs need to be a separate task. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
readthedocs/builds/syncers.py
Outdated
def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=arguments-differ | ||
RemotePuller.copy(path, target, host, is_file, **kwargs) | ||
|
||
if isinstance(storage, FileSystemStorage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best check for whether or not we should execute this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same logic in 8a602c8 to test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit of weird logic but I'm not sure what the alternative is. If the storage is local, we don't want to copy/delete the file. If it is, we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:
class LocalFileSystemStorage(FileSystemStorage):
can_sync = True
if getattr(storage, 'can_sync', False):
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking a bit more on this, I don't think it'll solve the problem. The issue is that things are very different if storage is backed by cloud blob storage vs. the local filesystem.
If storage is backed by cloud storage, only 1 web needs to sync any given PDF/Epub/zip to the cloud. Then, if once the copy is done, the original local file can be deleted.
If storage is backed by the filesystem, then the files have already been synced to the right place by the super class. We don't want to delete the local file as that's the file we're doing to serve. We also don't want to do a copy to "storage" as that's a copy from/to the same file.
This is all regardless of what the default syncer/puller is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thinking a bit more, maybe this is possible. We don't want to sync PDFs/ePubs for local filesystem storage at all (they're already synced by the puller/syncer). However, we could add a flag to the storage (AzureMediaStorage for us) specifically for this. That's probably better than doing an ininstance
check.
readthedocs/builds/syncers.py
Outdated
storage.save(storage_path, fd) | ||
|
||
# remove the original after copying | ||
os.remove(target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove this for now, we should be able to deploy this, and keep a copy on the webs as well as in Azure? Seems like that might be a good idea for transitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
if storage.exists(storage_path): | ||
return HttpResponseRedirect(storage.url(storage_path)) | ||
|
||
media_path = os.path.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe this logic is actually what the DefaultFileStorage will do, so I don't think this will ever get hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will. In the case where the storage is Azure storage but the PDF isn't yet on Azure (it's only locally on the webs) then this logic is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. 👍
Looks like we likely have an issue deleting the existing PDF/epub files. I don't believe we updated this logic: |
I can try to take a look at that. I don't think it will be that big of a deal. |
The tricky thing is the split logic between doing it across all webs or just a single one. I think perhaps moving it into the |
I believe this is mostly ready to go. I think it would be 💯 to get it shipped this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this approach looks good! Just to clarify, the RemotePuller subclass here will execute on a single web server, correct?
readthedocs/builds/syncers.py
Outdated
def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=arguments-differ | ||
RemotePuller.copy(path, target, host, is_file, **kwargs) | ||
|
||
if isinstance(storage, FileSystemStorage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:
class LocalFileSystemStorage(FileSystemStorage):
can_sync = True
if getattr(storage, 'can_sync', False):
pass
version_slug, | ||
self.slug, | ||
extension, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we strictly talking filesystem path here, or URL path, or both? If filesystem, this should use os.path.join()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a storage check, not a filesystem check.
'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip'))) | ||
return HttpResponseRedirect(path) | ||
storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug) | ||
if storage.exists(storage_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it also make sense to cache this lookup then?
Just so I'm 100% clear on this, this code is taken when a new build occurs and the user has disabled PDF/ePub building? So we want to delete the PDF/ePub so it isn't old, correct? |
- Basically we need to explicitly tell the storage to sync build media
Yep. When users turn off epub/pdf on their builds, we should stop serving the old ones, otherwise they get out of date really quickly and confuse users. |
readthedocs/projects/tasks.py
Outdated
version_slug=version.slug, | ||
), | ||
]) | ||
if delete: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an issue around this. While I was testing the removal feature, this flag was always False
. I'm actually not 100% sure why we need this flag at all. Wouldn't we always want to remove old PDFs/ePubs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only be true when being called w/ the broadcast. I guess we could delete the old ones before replacing them with the new ones, but that seems like wasted effort, but arguably could be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could delete the old ones before replacing them with the new ones, but that seems like wasted effort, but arguably could be done.
For storage, you have to do this. There isn't an "overwrite" option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only be true when being called w/ the broadcast.
Is the implication here that this won't happen when run locally? I wasn't able to successfully test deleting artifacts when PDF/Epubs are disabled locally.
Edit: I tested it, but it didn't work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this locally and it works 👍
My only thought is if we want to comingle the build output container with internal cold storage JSON with the media files. I wonder if we should create another class in ext
and use that in prod? So that we can give it it's own container?
Did that here: https://github.com/rtfd/readthedocs-ext/pull/226
Only do it when we really mean it.
I think media having its own container is good. I merged it. |
As the author, I can't approve my own PR but I tested your changes and they work for me as well. I did a build with PDFs/ePubs, saw them in blob storage, turned off PDFs/ePubs, ran a build and saw that those files were deleted. One thing about this PR is that there is a lot of extra logic to selectively write to blob storage (PDFs, ePubs, but not HTML) and some extra logic to delete files that should be deleted (we should run the delete from 1 web not all to avoid unnecessary duplication). I suspect that this whole process would be simpler we just wrote all build output to blob storage and didn't have these special cases. I'm looking forward to those enhancements. |
Agreed. It would be pretty trivial to just sync everything for now, but keep all of it syncing to the webs, if we wanted. That way the HTML & JSON is there when we're ready for it. |
Merging this, and we can iterate on it once we have more experience w/ it in prod. |
Adds a syncer that stores PDFs, ePubs and zip files in media storage (Azure storage, S3, etc.). Media storage is set by DEFAULT_FILE_STORAGE although the syncer ignores this if the setting is backed by the local filesystem (which is the Django default).
This should allow us to move ~1-2TB of PDFs/ePubs/zips off to blob storage rather than being on the local webs. The download links now check if the item is in storage and fall back to checking disk. This will allow us to deploy the change without moving all the files. When a build for a pre-existing version happens, the result PDF/ePub/zip will be copied to media storage and then deleted off the local web.
Overall, this concept where some build artifacts are stored in storage and others aren't is a bit weird (rather than being all on storage or all on web) but given that's what we want I'm not sure what the alternative is.
Testing this is somewhat challenging and I'm open to suggestions on making that part of this better.