Skip to content

Store ePubs and PDFs in media storage #4947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Feb 26, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions readthedocs/builds/syncers.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,15 @@

from builtins import object
from django.conf import settings
from django.core.exceptions import SuspiciousFileOperation
from django.core.files.storage import get_storage_class, FileSystemStorage

from readthedocs.core.utils.extend import SettingsOverrideObject
from readthedocs.core.utils import safe_makedirs


log = logging.getLogger(__name__)
storage = get_storage_class()()


class BaseSyncer(object):
Expand All @@ -43,6 +46,11 @@ def copy(cls, path, target, is_file=False, **kwargs):
return
if os.path.exists(target):
os.remove(target)

# Create containing directory if it doesn't exist
directory = os.path.dirname(target)
safe_makedirs(directory)

shutil.copy2(path, target)
else:
if os.path.exists(target):
Expand Down Expand Up @@ -138,6 +146,10 @@ def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=a
log.info("Remote Pull %s to %s", path, target)
if not is_file and not os.path.exists(target):
safe_makedirs(target)
if is_file:
# Create containing directory if it doesn't exist
directory = os.path.dirname(target)
safe_makedirs(directory)
# Add a slash when copying directories
sync_cmd = "rsync -e 'ssh -T' -av --delete {user}@{host}:{path} {target}".format(
host=host,
Expand All @@ -154,6 +166,63 @@ def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=a
)


class SelectiveStorageRemotePuller(RemotePuller):

"""
Exactly like RemotePuller except that certain files are copied via Django's storage system

If a file with extensions specified by ``extensions`` is copied, it will be copied to storage
and the original is removed.

See: https://docs.djangoproject.com/en/1.11/ref/settings/#std:setting-DEFAULT_FILE_STORAGE
"""

extensions = ('.pdf', '.epub', '.zip')

@classmethod
def get_storage_path(cls, path):
"""
Gets the path to the file within the storage engine

For example, if the path was $MEDIA_ROOT/pdfs/latest.pdf
the storage_path is 'pdfs/latest.pdf'

:raises: SuspiciousFileOperation if the path isn't under settings.MEDIA_ROOT
"""
path = os.path.normpath(path)
if not path.startswith(settings.MEDIA_ROOT):
raise SuspiciousFileOperation

path = path.replace(settings.MEDIA_ROOT, '').lstrip('/')
return path

@classmethod
def copy(cls, path, target, host, is_file=False, **kwargs): # pylint: disable=arguments-differ
RemotePuller.copy(path, target, host, is_file, **kwargs)

if isinstance(storage, FileSystemStorage):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best check for whether or not we should execute this logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same logic in 8a602c8 to test for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit of weird logic but I'm not sure what the alternative is. If the storage is local, we don't want to copy/delete the file. If it is, we do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're running into this in multiple places, perhaps it would be better to subsclass this and be explicit here about what we're actually trying to check with this condition. Having implicit rules based on class type can lead to maintenance headaches. So:

class LocalFileSystemStorage(FileSystemStorage):
    can_sync = True

if getattr(storage, 'can_sync', False):
    pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking a bit more on this, I don't think it'll solve the problem. The issue is that things are very different if storage is backed by cloud blob storage vs. the local filesystem.

If storage is backed by cloud storage, only 1 web needs to sync any given PDF/Epub/zip to the cloud. Then, if once the copy is done, the original local file can be deleted.

If storage is backed by the filesystem, then the files have already been synced to the right place by the super class. We don't want to delete the local file as that's the file we're doing to serve. We also don't want to do a copy to "storage" as that's a copy from/to the same file.

This is all regardless of what the default syncer/puller is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thinking a bit more, maybe this is possible. We don't want to sync PDFs/ePubs for local filesystem storage at all (they're already synced by the puller/syncer). However, we could add a flag to the storage (AzureMediaStorage for us) specifically for this. That's probably better than doing an ininstance check.

# This is a sanity check for the case where
# storage is backed by the local filesystem
# In that case, removing the original target file locally
# would remove the file from storage as well
return

if is_file and os.path.exists(target) and \
any([target.lower().endswith(ext) for ext in cls.extensions]):
log.info("Selective Copy %s to media storage", target)

storage_path = cls.get_storage_path(target)

if storage.exists(storage_path):
storage.delete(storage_path)

with open(target, 'rb') as fd:
storage.save(storage_path, fd)

# remove the original after copying
os.remove(target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove this for now, we should be able to deploy this, and keep a copy on the webs as well as in Azure? Seems like that might be a good idea for transitioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.



class Syncer(SettingsOverrideObject):
_default_class = LocalSyncer
_override_setting = 'FILE_SYNCER'
37 changes: 31 additions & 6 deletions readthedocs/projects/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from django.conf import settings
from django.contrib.auth.models import User
from django.core.urlresolvers import NoReverseMatch, reverse
from django.core.files.storage import get_storage_class
from django.db import models
from django.utils.encoding import python_2_unicode_compatible
from django.utils.translation import ugettext_lazy as _
Expand All @@ -37,6 +38,7 @@
from readthedocs.vcs_support.utils import Lock, NonBlockingLock

log = logging.getLogger(__name__)
storage = get_storage_class()()


@python_2_unicode_compatible
Expand Down Expand Up @@ -411,6 +413,24 @@ def get_subproject_urls(self):
return [(proj.child.slug, proj.child.get_docs_url())
for proj in self.subprojects.all()]

def get_storage_path(self, type_, version_slug=LATEST):
"""
Get a path to a build artifact for use with Django's storage system

:param type_: Media content type, ie - 'pdf', 'htmlzip'
:param version_slug: Project version slug for lookup
:return: the path to an item in storage
(can be used with ``storage.url`` to get the URL)
"""
extension = type_.replace('htmlzip', 'zip')
return '{}/{}/{}/{}.{}'.format(
type_,
self.slug,
version_slug,
self.slug,
extension,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we strictly talking filesystem path here, or URL path, or both? If filesystem, this should use os.path.join()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a storage check, not a filesystem check.


def get_production_media_path(self, type_, version_slug, include_file=True):
"""
Used to see if these files exist so we can offer them for download.
Expand Down Expand Up @@ -612,18 +632,23 @@ def has_aliases(self):
def has_pdf(self, version_slug=LATEST):
if not self.enable_pdf_build:
return False
return os.path.exists(self.get_production_media_path(
type_='pdf', version_slug=version_slug))

path = self.get_production_media_path(type_='pdf', version_slug=version_slug)
storage_path = self.get_storage_path(type_='pdf', version_slug=version_slug)
return os.path.exists(path) or storage.exists(storage_path)

def has_epub(self, version_slug=LATEST):
if not self.enable_epub_build:
return False
return os.path.exists(self.get_production_media_path(
type_='epub', version_slug=version_slug))

path = self.get_production_media_path(type_='epub', version_slug=version_slug)
storage_path = self.get_storage_path(type_='epub', version_slug=version_slug)
return os.path.exists(path) or storage.exists(storage_path)

def has_htmlzip(self, version_slug=LATEST):
return os.path.exists(self.get_production_media_path(
type_='htmlzip', version_slug=version_slug))
path = self.get_production_media_path(type_='htmlzip', version_slug=version_slug)
storage_path = self.get_storage_path(type_='htmlzip', version_slug=version_slug)
return os.path.exists(path) or storage.exists(storage_path)

@property
def sponsored(self):
Expand Down
39 changes: 24 additions & 15 deletions readthedocs/projects/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -957,40 +957,49 @@ def move_files(version_pk, hostname, html=False, localmedia=False,
Syncer.copy(from_path, to_path, host=hostname)

if localmedia:
from_path = version.project.artifact_path(
version=version.slug,
type_='sphinx_localmedia',
from_path = os.path.join(
version.project.artifact_path(
version=version.slug,
type_='sphinx_localmedia',
),
'{}.zip'.format(version.project.slug),
)
to_path = version.project.get_production_media_path(
type_='htmlzip',
version_slug=version.slug,
include_file=False,
include_file=True,
)
Syncer.copy(from_path, to_path, host=hostname)
Syncer.copy(from_path, to_path, host=hostname, is_file=True)

# Always move PDF's because the return code lies.
if pdf:
from_path = version.project.artifact_path(
version=version.slug,
type_='sphinx_pdf',
from_path = os.path.join(
version.project.artifact_path(
version=version.slug,
type_='sphinx_pdf',
),
'{}.pdf'.format(version.project.slug),
)
to_path = version.project.get_production_media_path(
type_='pdf',
version_slug=version.slug,
include_file=False,
include_file=True,
)
Syncer.copy(from_path, to_path, host=hostname)
Syncer.copy(from_path, to_path, host=hostname, is_file=True)
if epub:
from_path = version.project.artifact_path(
version=version.slug,
type_='sphinx_epub',
from_path = os.path.join(
version.project.artifact_path(
version=version.slug,
type_='sphinx_epub',
),
'{}.epub'.format(version.project.slug),
)
to_path = version.project.get_production_media_path(
type_='epub',
version_slug=version.slug,
include_file=False,
include_file=True,
)
Syncer.copy(from_path, to_path, host=hostname)
Syncer.copy(from_path, to_path, host=hostname, is_file=True)


@app.task(queue='web')
Expand Down
20 changes: 15 additions & 5 deletions readthedocs/projects/views/public.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from django.contrib import messages
from django.contrib.auth.models import User
from django.core.cache import cache
from django.core.files.storage import get_storage_class
from django.core.urlresolvers import reverse
from django.http import Http404, HttpResponse, HttpResponseRedirect
from django.shortcuts import get_object_or_404, render
Expand All @@ -35,6 +36,7 @@
log = logging.getLogger(__name__)
search_log = logging.getLogger(__name__ + '.search')
mimetypes.add_type('application/epub+zip', '.epub')
storage = get_storage_class()()


class ProjectIndex(ListView):
Expand Down Expand Up @@ -151,7 +153,7 @@ def project_badge(request, project_slug):


def project_downloads(request, project_slug):
"""A detail view for a project with various dataz."""
"""A detail view for a project with various downloads."""
project = get_object_or_404(
Project.objects.protected(request.user), slug=project_slug)
versions = Version.objects.public(user=request.user, project=project)
Expand Down Expand Up @@ -190,10 +192,18 @@ def project_download_media(request, project_slug, type_, version_slug):
)
privacy_level = getattr(settings, 'DEFAULT_PRIVACY_LEVEL', 'public')
if privacy_level == 'public' or settings.DEBUG:
path = os.path.join(
settings.MEDIA_URL, type_, project_slug, version_slug,
'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip')))
return HttpResponseRedirect(path)
storage_path = version.project.get_storage_path(type_=type_, version_slug=version_slug)
if storage.exists(storage_path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the item is not in storage but is on disk, this will result in an extra network call. storage.exists results in an API call to Azure (or wherever storage is backed). However, it should be reasonably fast since it's in the same data center. In the case where media is backed by the local file system, this is just an os.path.isfile so not a big deal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine, and hopefully we will get files into the storage quickly. I'd be 👍 on syncing all the media files after we ship the code, so that we rarely hit this case, as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run into an issue locally, where I have a project with ~100 active versions. It required 100 calls to Azure in order to load the project downloads page. This feels less than ideal -- I think the only way to get around it is to denormalize the "has_pdf" state onto the Version model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems suboptimal. Perhaps we could check whether the last build for a version built PDFs and was successful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to cache this lookup then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to cache this lookup then?

My guess is no. Checking whether any specific epub or pdf exists is not very common.

return HttpResponseRedirect(storage.url(storage_path))

media_path = os.path.join(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe this logic is actually what the DefaultFileStorage will do, so I don't think this will ever get hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will. In the case where the storage is Azure storage but the PDF isn't yet on Azure (it's only locally on the webs) then this logic is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. 👍

settings.MEDIA_URL,
type_,
project_slug,
version_slug,
'%s.%s' % (project_slug, type_.replace('htmlzip', 'zip')),
)
return HttpResponseRedirect(media_path)

# Get relative media path
path = (
Expand Down