Search: recursively parse sections #7207

stsewd · 2020-06-18T23:48:10Z

This is on top of #7204

If we have a structure like

parent
- content
- content
  - h1
  - content
- content
  - h2
  - content

And we start indexing from parent,
we will index all children in the first step,
and then index each header later. This is, duplicating content.

This is solved by checking for a section till 1 level.
In this example, the first parsing will stop when finding the first h1,
not duplicating content. Later it will index the next nodes as usual.

Also, we can increase the depth check when parsing all sections, that way we don't rely anymore on the div used by sphinx to enclose a section, and avoid indexing duplicated content if other themes don't follow the same structure.

A real example of this is https://github.com/readthedocs/readthedocs.org/blob/a0d645c9b561c0189ba0956a1554f577c413ecdf/readthedocs/search/tests/data/mkdocs/in/gitbook/index.html (from #7208)

ericholscher

This could use a test to show exactly how it is working.

stsewd · 2020-06-23T00:13:55Z

the gitbook theme at https://github.com/readthedocs/readthedocs.org/blob/a0d645c9b561c0189ba0956a1554f577c413ecdf/readthedocs/search/tests/data/mkdocs/in/gitbook/index.html is a test case for this, I just wanted to put this logic in another PR to not make the other one more complex, without this logic tests on the other PR fail.

If we have an structure like - parent - content - content - h1 - content - content - h2 - content And we start indexing from `parent`, we will index all children in the first step, and then index each header later. This is, duplicating content. This is solved by checking for a section till 1 level. In this example, the first parsing will stop when finding the first h1, not duplicating content. Later it will index the next nodes as usual.

stsewd mentioned this pull request Jun 19, 2020

Search: index from html files for mkdocs projects #7208

Merged

stsewd requested review from ericholscher and a team June 19, 2020 00:36

ericholscher approved these changes Jun 22, 2020

View reviewed changes

Base automatically changed from more-general-parser to master June 23, 2020 01:51

stsewd force-pushed the recursive-parser branch from 9f5239e to 24aa02a Compare June 23, 2020 02:04

stsewd merged commit ec9022c into master Jun 23, 2020

stsewd deleted the recursive-parser branch June 23, 2020 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: recursively parse sections #7207

Search: recursively parse sections #7207

stsewd commented Jun 18, 2020 •

edited

Loading

ericholscher left a comment

stsewd commented Jun 23, 2020

Search: recursively parse sections #7207

Search: recursively parse sections #7207

Conversation

stsewd commented Jun 18, 2020 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

stsewd commented Jun 23, 2020

stsewd commented Jun 18, 2020 •

edited

Loading