Skip to content

Search: stop relying on the DB when indexing #10623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stsewd opened this issue Aug 10, 2023 · 3 comments · Fixed by #10696
Closed

Search: stop relying on the DB when indexing #10623

stsewd opened this issue Aug 10, 2023 · 3 comments · Fixed by #10696
Assignees
Labels
Needed: design decision A core team decision is required

Comments

@stsewd
Copy link
Member

stsewd commented Aug 10, 2023

What's the problem this feature will solve?

Currently, we are keeping track of all HTML files (this is one of our largest table). We do this mainly for re-indexing (we will be relying on these models for handling 404s, but we don't need to keep track of all files for that).

Describe the solution you'd like

Instead of relying on the DB, walk the storage. And we can get the search ignore/ranking patterns from the config of the build object attached to the version.

def config(self):
"""
Proxy to the configuration of the build.
:returns: The configuration used in the last successful build.
:rtype: dict
"""
last_build = (
self.builds(manager=INTERNAL).filter(
state=BUILD_STATE_FINISHED,
success=True,
).order_by('-date')
.only('_config')
.first()
)
if last_build:
return last_build.config
return None

We would still need to create HTMLFile/ImportedFile models, but only the ones that are needed for our 404 handler (**/index.html, 404.html, and robots.txt)

Alternative solutions

None

Additional context

ref #10512

@stsewd stsewd added the Needed: design decision A core team decision is required label Aug 10, 2023
@humitos
Copy link
Member

humitos commented Aug 21, 2023

This is good to me 👍🏼

I'd like to see some data about:

  • how many AWS S3 requests we would need to do for a normal-size project?
  • what would happen for projects with a bunch of files and directories/sub-directories?

@stsewd
Copy link
Member Author

stsewd commented Aug 30, 2023

@humitos
Copy link
Member

humitos commented Aug 31, 2023

That is 9.11s avg to index a project completely? From that graph, I see that 2.5s from those are because of boto3 calls to the storage.

stsewd added a commit that referenced this issue Aug 31, 2023
@github-project-automation github-project-automation bot moved this to Planned in 📍Roadmap Sep 11, 2023
@stsewd stsewd moved this from Planned to In progress in 📍Roadmap Sep 11, 2023
@stsewd stsewd self-assigned this Sep 11, 2023
@stsewd stsewd moved this from In progress to Planned in 📍Roadmap Sep 11, 2023
@stsewd stsewd moved this from Planned to In progress in 📍Roadmap Sep 11, 2023
@agjohnson agjohnson moved this from In progress to Needs review in 📍Roadmap Sep 13, 2023
stsewd added a commit that referenced this issue Sep 14, 2023
- Removed the "wipe" actions from the admin instead of porting them, since I'm not sure that we need an action in the admin just to delete the search index of a project. Re-index seems useful.
- `fileify` was replaced by `index_build`, and it only requires the build id to be passed, any other information can be retrieved from the build/version object.
- `fileify` isn't removed in this PR to avoid downtimes during deploy, it's safe to keep it around till next deploy.
- New code is avoiding any deep connection to the django-elasticsearch-dsl package, since it doesn't make sense anymore to have it, and I'm planning on removing it.
- We are no longer tracking all files in the DB, only the ones of interest.
- Re-indexing a version will also re-evaluate the files from the DB, useful for old projects that are out of sync.
- The reindex command now generates taks per-version rather than per-collection of files, since we no longer track all files in the DB.


- Closes #10623
- Closes #10690

We don't need to do anything special during deploy, zero downtime out of the box. We can trigger a re-index for all versions if we want to delete the HTML files that we don't need from the DB, but that operation will also re-index their contents in ES, so probably better do that after we are all settled with any changes to ES.
@github-project-automation github-project-automation bot moved this from Needs review to Done in 📍Roadmap Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needed: design decision A core team decision is required
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants