-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Search indexing with storage #5854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search indexing with storage #5854
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I just did a quick look and looks good
There are going to be some performance implications of this when storage is cloud storage as opposed to local storage. If, however, we stop doing any processing on non-HTML files, the performance difference probably won't matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good approach, and would benefit from a few cleanup/speedup ideas I've been meaning to implement. I need to test this locally a bit to understand it fully. Really excited to get this working.
type_='json', version_slug=self.version.slug, include_file=False | ||
) | ||
try: | ||
for fjson_path in fjson_paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should be able to get away from this soon. We should be starting to store the proper path of a file after readthedocs/readthedocs-sphinx-ext#62 is merged.
readthedocs/projects/tasks.py
Outdated
storage_path = version.project.get_storage_path( | ||
type_='html', version_slug=version.slug, include_file=False | ||
) | ||
for root, __, filenames in storage.walk(storage_path): | ||
for filename in filenames: | ||
if filename.endswith('.html'): | ||
model_class = HTMLFile | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add the elif project.cdn_enabled
here with an else: continue
. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does project.cdn_enabled
do exactly and what would it do differently for search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means we store all ImportedFile's not just HTMLFile's, so we can purge them from the CDN properly when they change.
Codecov Report
@@ Coverage Diff @@
## master #5854 +/- ##
=========================================
Coverage ? 79.26%
=========================================
Files ? 175
Lines ? 10826
Branches ? 1350
=========================================
Hits ? 8581
Misses ? 1891
Partials ? 354
Continue to review full report at Codecov.
|
This should be good for a full review. The test failure seems to be in master and is unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good 🎉. In general, this should just work locally and in prod, right? Locally & .com it will just use the filesystem, but in prod we'll be hitting backend storage, which might cause some performance issues?
This is going to conflict with a bit of the logic we added in the search updates around GSOC, so I'm going to wait to merge this until after that is merged. It shouldn't be too much, we just had to change a bit of the logic around the ordering of creating HTMLFile's & SphinxDomain's and indexing them.
:param commit: Commit that updated path | ||
:param build: Build id | ||
""" | ||
|
||
if not settings.RTD_BUILD_MEDIA_STORAGE: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should likely log something here, so we don't get confused if indexing isn't working.
readthedocs/settings/base.py
Outdated
@@ -234,7 +234,7 @@ def USE_PROMOS(self): # noqa | |||
|
|||
# Optional Django Storage subclass used to write build artifacts to cloud or local storage | |||
# https://docs.readthedocs.io/en/stable/settings.html#build-media-storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link needs to be updated, and the default changed.
Codecov Report
@@ Coverage Diff @@
## master #5854 +/- ##
=========================================
Coverage ? 79.26%
=========================================
Files ? 175
Lines ? 10826
Branches ? 1350
=========================================
Hits ? 8581
Misses ? 1891
Partials ? 354
Continue to review full report at Codecov.
|
I think this is good for a full review. For .org and for development, this should just work without issue. For .com, I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to get this shipped and delete JSON from our disks. 🎉
Should be good to merge w/ conflicts fixed.
It looks like some of the test directories changed in the last week and that's causing the test failures. I'll get them fixed up. |
- Fixed a bug involving the path on imported files - Fixed tests to account for extra files in test dir
Ok, I believe the tests should pass. The added tests did actually uncover a bug in this implementation. The paths generated were not quite right. |
Great, I'll get this shipped today 👍 |
Changes
HTMLFile
s andImportedFile
s from storage rather than from local disksettings.RTD_BUILD_MEDIA_STORAGE
. This has implications for RTD corporate as well as development.Note: "storage" could be a filesystem backed storage rather than a remote cloud storage but basically this uses the Django storage abstraction.