Skip to content

Search: Only use generic parsers #10272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
benjaoming opened this issue Apr 26, 2023 · 6 comments · Fixed by #10676
Closed
4 tasks done

Search: Only use generic parsers #10272

benjaoming opened this issue Apr 26, 2023 · 6 comments · Fixed by #10676
Assignees
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code Operations Operations or server issue

Comments

@benjaoming
Copy link
Contributor

benjaoming commented Apr 26, 2023

Since #10128 removed specific Sphinx domain logic in the SphinxParser, we aren't doing much more search indexing that's specific for Sphinx or MkDocs.

  • Make SphinxParser redundant (move anything essential into the GenericParser)
  • Make MkDocsParser redundant (move anything essential into the GenericParser)
  • Remove entire readthedocs.sphinx_domains application?
  • Backup and delete database table (ops)

CC: @stsewd does this list look good? Perhaps we can break into separate issues/PRs if any change becomes too complex.

@benjaoming benjaoming added Improvement Minor improvement to code Operations Operations or server issue Accepted Accepted issue on our roadmap labels Apr 26, 2023
@github-project-automation github-project-automation bot moved this to Planned in 📍Roadmap Apr 26, 2023
@stsewd
Copy link
Member

stsewd commented Apr 26, 2023

Looks good. Not sure if we need to back up the tables, since we aren't using them anymore, but not strong opinion about not doing it.

We should also check if we can remove the generation of the readthedocs-sphinx-domain-names.json file from our sphinx extension and the fjson files too.

type_file = build_media_storage.join(json_storage_path, 'readthedocs-sphinx-domain-names.json')

https://github.com/readthedocs/readthedocs-sphinx-ext/blob/a46f5ca05c7c8ec7ec30b44cc724ac334ce91a38/readthedocs_ext/readthedocs.py#L198

https://github.com/readthedocs/readthedocs-sphinx-ext/blob/f1145b1819458643f442b21083b3397aeebc8e72/readthedocs_ext/readthedocs.py#L280-L280

@humitos
Copy link
Member

humitos commented Apr 27, 2023

We should also check if we can remove the generation of the readthedocs-sphinx-domain-names.json file from our sphinx extension and the fjson files too.

💯 , this will speed up the build time a little bit. I'm 👍🏼 on removing it if possible.

stsewd added a commit that referenced this issue Jun 21, 2023
We are indexing sphinx domains as sections now, we no longer need to create sphinx domains or index them. We should see some small improvement in build time, and maybe in search?

This change will require a re-index, since old projects that haven't been re-build still have sphinx domain objects indexed separately from sections (this step can be done after or before the deploy).

This also fixes a small bug, we are still indexing sphinx domains titles, without their content

![Screenshot 2023-06-20 at 14-36-48 Read the Docs documentation simplified](https://github.com/readthedocs/readthedocs.org/assets/4975310/eeb1865b-a859-4010-9cff-95a95d877882)

ref #10272
@humitos
Copy link
Member

humitos commented Jun 22, 2023

It would be awesome to return to this work and finish this project (cc @ericholscher). The work on this issue will allow us to use always the same parser without matter the doctool the user is using (which will give them the exact same search experience). We are pretty close now! I arrived here while doing the cleanup of the feature flags at #9779

Remember to remove the INDEX_FROM_HTML_FILES as well when doing this work. It won't be required anymore, since everybody will be indexing from HTML files.

@ericholscher
Copy link
Member

Agreed, this is a useful next step on this work. I thought we'd already finished making everything use the HTML-only parsing.

@humitos
Copy link
Member

humitos commented Jun 26, 2023

While reviewing #10042 I found the work from this issue is, in part, required to removed the Version.documentation_type attribute.

@ericholscher
Copy link
Member

We're slowly rolling this out via feature flag 👍

stsewd added a commit to readthedocs/readthedocs-sphinx-ext that referenced this issue Aug 23, 2023
We were using these for search, but we no longer need them.
We index the HTML directly now.

Ref readthedocs/readthedocs.org#10272
stsewd added a commit that referenced this issue Aug 28, 2023
Closes #10272.

We can't remove the code that generates the fjson files,
since they are being used by the embed API (v2),
that API is mentioned as deprecated,
we should do the actual deprecation by contacting
projects using it.
@stsewd stsewd moved this from In progress to Needs review in 📍Roadmap Aug 30, 2023
@github-project-automation github-project-automation bot moved this from Needs review to Done in 📍Roadmap Aug 31, 2023
stsewd added a commit that referenced this issue Aug 31, 2023
Closes #10272.

We can't remove the code that generates the fjson files,
since they are being used by the embed API (v2),
that API is mentioned as deprecated,
we should do the actual deprecation by contacting
projects using it.

---------

Co-authored-by: Manuel Kaufmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code Operations Operations or server issue
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants
@ericholscher @humitos @benjaoming @stsewd and others