-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Build: refactor search indexing process #11643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently, we walk the entire project directory to apply two operations: index files in ES, and keep track of index/404 files. These two operations are independent, but in our code they are kind of mixed together in order to avoid walking the project directory twice. I have abstracted the processing of the files with a "Indexer" class, which is responsible for doing an operation on a file, and at the end it can collect the results.
e9e61c2
to
a182899
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice refactor 💯
def collect(self, sync_id: int): | ||
raise NotImplementedError | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be called save
or store
? collect
isn't a super common usage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose collect as this "collects the results", that meaning anything. Can also be post_process
, but I'm fine with anything.
In the way we originally thought the implementation of filetreediff, we don't need to hit S3 to download HTML files, since the process to generate the manifest happens inside the build itself. We just need to walk over I'm not convinced about this implementation that entangle search and filetreediff moving the implementation of filetreediff to an async Celery task. I'd prefer to keep the process of filetreediff separated from search and ran inside the build process itself as we discussed previously. |
This refactor works independently of the final implementation of file tree diff, so merging. Discussion for FTD is happening at #11646. |
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
Currently, we walk the entire project directory to apply two operations: index files in ES, and keep track of index/404 files. These two operations are independent, but in our code they are kind of mixed together in order to avoid walking the project directory twice.
I have abstracted the processing of the files with an "Indexer" class, which is responsible for doing an operation on a file, and at the end it can collect the results.
Why this refactor now? With the file tree diff feature coming (#11507), I found it easier to just do the operation at the same time we index files for search. We can cache the file contents, so we won't need to read the file twice from storage, and also won't need an additional API to save the results from the builders.