Build: refactor search indexing process #11643

stsewd · 2024-10-02T22:07:24Z

Currently, we walk the entire project directory to apply two operations: index files in ES, and keep track of index/404 files. These two operations are independent, but in our code they are kind of mixed together in order to avoid walking the project directory twice.

I have abstracted the processing of the files with an "Indexer" class, which is responsible for doing an operation on a file, and at the end it can collect the results.

Why this refactor now? With the file tree diff feature coming (#11507), I found it easier to just do the operation at the same time we index files for search. We can cache the file contents, so we won't need to read the file twice from storage, and also won't need an additional API to save the results from the builders.

Currently, we walk the entire project directory to apply two operations: index files in ES, and keep track of index/404 files. These two operations are independent, but in our code they are kind of mixed together in order to avoid walking the project directory twice. I have abstracted the processing of the files with a "Indexer" class, which is responsible for doing an operation on a file, and at the end it can collect the results.

ericholscher

This is a nice refactor 💯

ericholscher · 2024-10-07T18:04:23Z

readthedocs/projects/tasks/search.py

+    def collect(self, sync_id: int):
+        raise NotImplementedError
+


I wonder if this should be called save or store? collect isn't a super common usage here.

I chose collect as this "collects the results", that meaning anything. Can also be post_process, but I'm fine with anything.

humitos · 2024-10-10T14:26:21Z

Why this refactor now? With the file tree diff feature coming (#11507), I found it easier to just do the operation at the same time we index files for search. We can cache the file contents, so we won't need to read the file twice from storage, and also won't need an additional API to save the results from the builders.

In the way we originally thought the implementation of filetreediff, we don't need to hit S3 to download HTML files, since the process to generate the manifest happens inside the build itself. We just need to walk over READTHEDOCS_OUTPUT/html locally and generate the hashes.

I'm not convinced about this implementation that entangle search and filetreediff moving the implementation of filetreediff to an async Celery task. I'd prefer to keep the process of filetreediff separated from search and ran inside the build process itself as we discussed previously.

stsewd · 2024-10-10T15:08:38Z

This refactor works independently of the final implementation of file tree diff, so merging. Discussion for FTD is happening at #11646.

sentry-io · 2024-10-15T15:45:31Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ ConnectionTimeout: Connection timed out during request readthedocs.projects.tasks.search.index_build View Issue
‼️ OperationalError: canceling statement due to statement timeout readthedocs.projects.tasks.search.index_build View Issue

_{Did you find this useful? React with a 👍 or 👎}

stsewd requested a review from a team as a code owner October 2, 2024 22:07

stsewd requested a review from humitos October 2, 2024 22:07

auto-assign bot assigned stsewd Oct 2, 2024

stsewd force-pushed the refactor-search-index-process branch from e9e61c2 to a182899 Compare October 2, 2024 22:08

stsewd mentioned this pull request Oct 3, 2024

File tree diff #11646

Merged

ericholscher approved these changes Oct 7, 2024

View reviewed changes

stsewd merged commit 14e4353 into main Oct 10, 2024
6 checks passed

stsewd deleted the refactor-search-index-process branch October 10, 2024 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build: refactor search indexing process #11643

Build: refactor search indexing process #11643

stsewd commented Oct 2, 2024

ericholscher left a comment

ericholscher Oct 7, 2024

stsewd Oct 7, 2024

humitos commented Oct 10, 2024

stsewd commented Oct 10, 2024

sentry-io bot commented Oct 15, 2024 •

edited

Loading

Build: refactor search indexing process #11643

Build: refactor search indexing process #11643

Conversation

stsewd commented Oct 2, 2024

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher Oct 7, 2024

Choose a reason for hiding this comment

stsewd Oct 7, 2024

Choose a reason for hiding this comment

humitos commented Oct 10, 2024

stsewd commented Oct 10, 2024

sentry-io bot commented Oct 15, 2024 • edited Loading

Suspect Issues

sentry-io bot commented Oct 15, 2024 •

edited

Loading