Search: stop relying on the DB when indexing #10623

stsewd · 2023-08-10T18:49:47Z

What's the problem this feature will solve?

Currently, we are keeping track of all HTML files (this is one of our largest table). We do this mainly for re-indexing (we will be relying on these models for handling 404s, but we don't need to keep track of all files for that).

Describe the solution you'd like

Instead of relying on the DB, walk the storage. And we can get the search ignore/ranking patterns from the config of the build object attached to the version.

readthedocs.org/readthedocs/builds/models.py

Lines 310 to 327 in 84f889a

    
               def config(self): 
        
                   """ 
        
                   Proxy to the configuration of the build. 
        
                   :returns: The configuration used in the last successful build. 
        
                   :rtype: dict 
        
                   """ 
        
                   last_build = ( 
        
                       self.builds(manager=INTERNAL).filter( 
        
                           state=BUILD_STATE_FINISHED, 
        
                           success=True, 
        
                       ).order_by('-date') 
        
                       .only('_config') 
        
                       .first() 
        
                   ) 
        
                   if last_build: 
        
                       return last_build.config 
        
                   return None

We would still need to create HTMLFile/ImportedFile models, but only the ones that are needed for our 404 handler (**/index.html, 404.html, and robots.txt)

Alternative solutions

None

Additional context

ref #10512

The text was updated successfully, but these errors were encountered:

humitos · 2023-08-21T14:40:14Z

This is good to me 👍🏼

I'd like to see some data about:

how many AWS S3 requests we would need to do for a normal-size project?
what would happen for projects with a bunch of files and directories/sub-directories?

stsewd · 2023-08-30T15:59:39Z

@humitos times would be similar to the fileify task https://one.newrelic.com/nr1-core/apm-features/transactions/NjM3Njd8QVBNfEFQUExJQ0FUSU9OfDIzNzkwOTY?account=63767&duration=7776000000&filters=%28domain%20IN%20%28%27APM%27%2C%20%27EXT%27%29%20AND%20type%20IN%20%28%27APPLICATION%27%2C%20%27SERVICE%27%29%29&state=67722edc-1dda-77a2-6e2d-45498ce1eca2, to optimize the process, we can trigger the search index when searching for index.html/404.html files.

humitos · 2023-08-31T11:19:11Z

That is 9.11s avg to index a project completely? From that graph, I see that 2.5s from those are because of boto3 calls to the storage.

- Closes #10623 - Closes #10690

- Removed the "wipe" actions from the admin instead of porting them, since I'm not sure that we need an action in the admin just to delete the search index of a project. Re-index seems useful. - `fileify` was replaced by `index_build`, and it only requires the build id to be passed, any other information can be retrieved from the build/version object. - `fileify` isn't removed in this PR to avoid downtimes during deploy, it's safe to keep it around till next deploy. - New code is avoiding any deep connection to the django-elasticsearch-dsl package, since it doesn't make sense anymore to have it, and I'm planning on removing it. - We are no longer tracking all files in the DB, only the ones of interest. - Re-indexing a version will also re-evaluate the files from the DB, useful for old projects that are out of sync. - The reindex command now generates taks per-version rather than per-collection of files, since we no longer track all files in the DB. - Closes #10623 - Closes #10690 We don't need to do anything special during deploy, zero downtime out of the box. We can trigger a re-index for all versions if we want to delete the HTML files that we don't need from the DB, but that operation will also re-index their contents in ES, so probably better do that after we are all settled with any changes to ES.

stsewd added the Needed: design decision A core team decision is required label Aug 10, 2023

humitos mentioned this issue Aug 22, 2023

ImportedFile: use BigAutoField for primary key #9669

Merged

stsewd mentioned this issue Aug 22, 2023

Proxito: Don't hit storage for 404s #10617

Merged

stsewd mentioned this issue Aug 30, 2023

Build: track imported files for external versions #10690

Closed

stsewd added a commit that referenced this issue Aug 31, 2023

Search: stop relying on the DB when indexing

871bfe2

- Closes #10623 - Closes #10690

stsewd mentioned this issue Aug 31, 2023

Search: stop relying on the DB when indexing #10696

Merged

stsewd added this to 📍Roadmap Sep 11, 2023

github-project-automation bot moved this to Planned in 📍Roadmap Sep 11, 2023

stsewd moved this from Planned to In progress in 📍Roadmap Sep 11, 2023

stsewd self-assigned this Sep 11, 2023

stsewd moved this from In progress to Planned in 📍Roadmap Sep 11, 2023

stsewd moved this from Planned to In progress in 📍Roadmap Sep 11, 2023

agjohnson moved this from In progress to Needs review in 📍Roadmap Sep 13, 2023

stsewd closed this as completed in #10696 Sep 14, 2023

github-project-automation bot moved this from Needs review to Done in 📍Roadmap Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: stop relying on the DB when indexing #10623

Search: stop relying on the DB when indexing #10623

stsewd commented Aug 10, 2023

humitos commented Aug 21, 2023

stsewd commented Aug 30, 2023

humitos commented Aug 31, 2023

Search: stop relying on the DB when indexing #10623

Search: stop relying on the DB when indexing #10623

Comments

stsewd commented Aug 10, 2023

What's the problem this feature will solve?

Describe the solution you'd like

Alternative solutions

Additional context

humitos commented Aug 21, 2023

stsewd commented Aug 30, 2023

humitos commented Aug 31, 2023