-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Store ePubs and PDFs in media storage #4947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
8f9576d
130bc2e
7a2216e
302b3fd
c3fa90e
b500528
8a602c8
6edacad
03787f4
a7db138
4fab169
9ac1e53
5e9597a
440567a
9ca62db
f97833b
8c14af7
34306ad
76456ab
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,12 +14,15 @@ | |
|
||
from builtins import object | ||
from django.conf import settings | ||
from django.core.exceptions import SuspiciousFileOperation | ||
from django.core.files.storage import get_storage_class, FileSystemStorage | ||
|
||
from readthedocs.core.utils.extend import SettingsOverrideObject | ||
from readthedocs.core.utils import safe_makedirs | ||
|
||
|
||
log = logging.getLogger(__name__) | ||
storage = get_storage_class()() | ||
|
||
|
||
class BaseSyncer(object): | ||
|
@@ -43,6 +46,11 @@ def copy(cls, path, target, is_file=False, **kwargs): | |
return | ||
if os.path.exists(target): | ||
os.remove(target) | ||
|
||
# Create containing directory if it doesn't exist | ||
directory = os.path.dirname(target) | ||
safe_makedirs(directory) | ||
|
||
shutil.copy2(path, target) | ||
else: | ||
if os.path.exists(target): | ||
|
@@ -138,6 +146,10 @@ def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=a | |
log.info("Remote Pull %s to %s", path, target) | ||
if not is_file and not os.path.exists(target): | ||
safe_makedirs(target) | ||
if is_file: | ||
# Create containing directory if it doesn't exist | ||
directory = os.path.dirname(target) | ||
safe_makedirs(directory) | ||
# Add a slash when copying directories | ||
sync_cmd = "rsync -e 'ssh -T' -av --delete {user}@{host}:{path} {target}".format( | ||
host=host, | ||
|
@@ -154,6 +166,63 @@ def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=a | |
) | ||
|
||
|
||
class SelectiveStorageRemotePuller(RemotePuller): | ||
|
||
""" | ||
Exactly like RemotePuller except that certain files are copied via Django's storage system | ||
|
||
If a file with extensions specified by ``extensions`` is copied, it will be copied to storage | ||
and the original is removed. | ||
|
||
See: https://docs.djangoproject.com/en/1.11/ref/settings/#std:setting-DEFAULT_FILE_STORAGE | ||
""" | ||
|
||
extensions = ('.pdf', '.epub', '.zip') | ||
|
||
@classmethod | ||
def get_storage_path(cls, path): | ||
""" | ||
Gets the path to the file within the storage engine | ||
|
||
For example, if the path was $MEDIA_ROOT/pdfs/latest.pdf | ||
the storage_path is 'pdfs/latest.pdf' | ||
|
||
:raises: SuspiciousFileOperation if the path isn't under settings.MEDIA_ROOT | ||
""" | ||
path = os.path.normpath(path) | ||
if not path.startswith(settings.MEDIA_ROOT): | ||
raise SuspiciousFileOperation | ||
|
||
path = path.replace(settings.MEDIA_ROOT, '').lstrip('/') | ||
return path | ||
|
||
@classmethod | ||
def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=arguments-differ | ||
RemotePuller.copy(path, target, host, is_file, **kwargs) | ||
|
||
if isinstance(storage, FileSystemStorage): | ||
# This is a sanity check for the case where | ||
# storage is backed by the local filesystem | ||
# In that case, removing the original target file locally | ||
# would remove the file from storage as well | ||
return | ||
|
||
if is_file and os.path.exists(target) and \ | ||
any([target.lower().endswith(ext) for ext in cls.extensions]): | ||
log.info("Selective Copy %s to media storage", target) | ||
|
||
storage_path = cls.get_storage_path(target) | ||
|
||
if storage.exists(storage_path): | ||
storage.delete(storage_path) | ||
|
||
with open(target, 'rb') as fd: | ||
storage.save(storage_path, fd) | ||
|
||
# remove the original after copying | ||
os.remove(target) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we remove this for now, we should be able to deploy this, and keep a copy on the webs as well as in Azure? Seems like that might be a good idea for transitioning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. |
||
|
||
|
||
class Syncer(SettingsOverrideObject): | ||
_default_class = LocalSyncer | ||
_override_setting = 'FILE_SYNCER' |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ | |
from django.conf import settings | ||
from django.contrib.auth.models import User | ||
from django.core.urlresolvers import NoReverseMatch, reverse | ||
from django.core.files.storage import get_storage_class | ||
from django.db import models | ||
from django.utils.encoding import python_2_unicode_compatible | ||
from django.utils.translation import ugettext_lazy as _ | ||
|
@@ -37,6 +38,7 @@ | |
from readthedocs.vcs_support.utils import Lock, NonBlockingLock | ||
|
||
log = logging.getLogger(__name__) | ||
storage = get_storage_class()() | ||
|
||
|
||
@python_2_unicode_compatible | ||
|
@@ -411,6 +413,24 @@ def get_subproject_urls(self): | |
return [(proj.child.slug, proj.child.get_docs_url()) | ||
for proj in self.subprojects.all()] | ||
|
||
def get_storage_path(self, type_, version_slug=LATEST): | ||
""" | ||
Get a path to a build artifact for use with Django's storage system | ||
|
||
:param type_: Media content type, ie - 'pdf', 'htmlzip' | ||
:param version_slug: Project version slug for lookup | ||
:return: the path to an item in storage | ||
(can be used with ``storage.url`` to get the URL) | ||
""" | ||
extension = type_.replace('htmlzip', 'zip') | ||
return '{}/{}/{}/{}.{}'.format( | ||
type_, | ||
self.slug, | ||
version_slug, | ||
self.slug, | ||
extension, | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we strictly talking filesystem path here, or URL path, or both? If filesystem, this should use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a storage check, not a filesystem check. |
||
|
||
def get_production_media_path(self, type_, version_slug, include_file=True): | ||
""" | ||
Used to see if these files exist so we can offer them for download. | ||
|
@@ -612,18 +632,23 @@ def has_aliases(self): | |
def has_pdf(self, version_slug=LATEST): | ||
if not self.enable_pdf_build: | ||
return False | ||
return os.path.exists(self.get_production_media_path( | ||
type_='pdf', version_slug=version_slug)) | ||
|
||
path = self.get_production_media_path(type_='pdf', version_slug=version_slug) | ||
storage_path = self.get_storage_path(type_='pdf', version_slug=version_slug) | ||
return os.path.exists(path) or storage.exists(storage_path) | ||
|
||
def has_epub(self, version_slug=LATEST): | ||
if not self.enable_epub_build: | ||
return False | ||
return os.path.exists(self.get_production_media_path( | ||
type_='epub', version_slug=version_slug)) | ||
|
||
path = self.get_production_media_path(type_='epub', version_slug=version_slug) | ||
storage_path = self.get_storage_path(type_='epub', version_slug=version_slug) | ||
return os.path.exists(path) or storage.exists(storage_path) | ||
|
||
def has_htmlzip(self, version_slug=LATEST): | ||
return os.path.exists(self.get_production_media_path( | ||
type_='htmlzip', version_slug=version_slug)) | ||
path = self.get_production_media_path(type_='htmlzip', version_slug=version_slug) | ||
storage_path = self.get_storage_path(type_='htmlzip', version_slug=version_slug) | ||
return os.path.exists(path) or storage.exists(storage_path) | ||
|
||
@property | ||
def sponsored(self): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ | |
from django.contrib import messages | ||
from django.contrib.auth.models import User | ||
from django.core.cache import cache | ||
from django.core.files.storage import get_storage_class | ||
from django.core.urlresolvers import reverse | ||
from django.http import Http404, HttpResponse, HttpResponseRedirect | ||
from django.shortcuts import get_object_or_404, render | ||
|
@@ -35,6 +36,7 @@ | |
log = logging.getLogger(__name__) | ||
search_log = logging.getLogger(__name__ + '.search') | ||
mimetypes.add_type('application/epub+zip', '.epub') | ||
storage = get_storage_class()() | ||
|
||
|
||
class ProjectIndex(ListView): | ||
|
@@ -151,7 +153,7 @@ def project_badge(request, project_slug): | |
|
||
|
||
def project_downloads(request, project_slug): | ||
"""A detail view for a project with various dataz.""" | ||
"""A detail view for a project with various downloads.""" | ||
project = get_object_or_404( | ||
Project.objects.protected(request.user), slug=project_slug) | ||
versions = Version.objects.public(user=request.user, project=project) | ||
|
@@ -190,10 +192,18 @@ def project_download_media(request, project_slug, type_, version_slug): | |
) | ||
privacy_level = getattr(settings, 'DEFAULT_PRIVACY_LEVEL', 'public') | ||
if privacy_level == 'public' or settings.DEBUG: | ||
path = os.path.join( | ||
settings.MEDIA_URL, type_, project_slug, version_slug, | ||
'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip'))) | ||
return HttpResponseRedirect(path) | ||
storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug) | ||
if storage.exists(storage_path): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the case where the item is not in storage but is on disk, this will result in an extra network call. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That seems fine, and hopefully we will get files into the storage quickly. I'd be 👍 on syncing all the media files after we ship the code, so that we rarely hit this case, as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did run into an issue locally, where I have a project with ~100 active versions. It required 100 calls to Azure in order to load the project downloads page. This feels less than ideal -- I think the only way to get around it is to denormalize the "has_pdf" state onto the Version model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That seems suboptimal. Perhaps we could check whether the last build for a version built PDFs and was successful? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it also make sense to cache this lookup then? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My guess is no. Checking whether any specific epub or pdf exists is not very common. |
||
return HttpResponseRedirect(storage.url(storage_path)) | ||
|
||
media_path = os.path.join( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Believe this logic is actually what the DefaultFileStorage will do, so I don't think this will ever get hit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes it will. In the case where the storage is Azure storage but the PDF isn't yet on Azure (it's only locally on the webs) then this logic is used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. 👍 |
||
settings.MEDIA_URL, | ||
type_, | ||
project_slug, | ||
version_slug, | ||
'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip')), | ||
) | ||
return HttpResponseRedirect(media_path) | ||
|
||
# Get relative media path | ||
path = ( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best check for whether or not we should execute this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same logic in 8a602c8 to test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit of weird logic but I'm not sure what the alternative is. If the storage is local, we don't want to copy/delete the file. If it is, we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking a bit more on this, I don't think it'll solve the problem. The issue is that things are very different if storage is backed by cloud blob storage vs. the local filesystem.
If storage is backed by cloud storage, only 1 web needs to sync any given PDF/Epub/zip to the cloud. Then, if once the copy is done, the original local file can be deleted.
If storage is backed by the filesystem, then the files have already been synced to the right place by the super class. We don't want to delete the local file as that's the file we're doing to serve. We also don't want to do a copy to "storage" as that's a copy from/to the same file.
This is all regardless of what the default syncer/puller is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thinking a bit more, maybe this is possible. We don't want to sync PDFs/ePubs for local filesystem storage at all (they're already synced by the puller/syncer). However, we could add a flag to the storage (AzureMediaStorage for us) specifically for this. That's probably better than doing an
ininstance
check.