Search: improve parser #7233

stsewd · 2020-06-25T02:43:49Z

This will make the parser more general and match
#7232
(also, one bug fix).

Try the main tag before trying the first h1
Always inspect all headers till 2 levels (this removes the need for
the special case from Sphinx, where the h tag is inside a div)
_parse_content now not only removes all new line chars, but it also
reduces multiple spaces into one.
Remove elements with the search role in addition to the navigation
role.
The headerlink class doesn't need to be inside an a tag.
Fix bug where calling .text() over a text node will return empty.
(I was able to catch this one now that we are checking till 2 levels)
Increase the depth to 3 for the first section (one mkdocs theme was setting the main tag in a node really up from the actual content)

This doesn't change the current indexing, maybe we will be indexing more content if we had a top text node (calling .text() would have returned empty).

This will make the parser more general and match #7232 (also, one bug fix). - Try the main tag before trying the first h1 - Always inspect all headers till 2 levels (this removes the need for the special case from Sphinx, where the h tag is inside a div) - `_parse_content` now not only removes all new line chars, but it also reduces multiple spaces into one. - Remove elements with the search role in addition to the navigation role. - The headerlink class doesn't need to be inside an `a` tag. - Fix bug where calling .text() over a text node will return empty. (I was able to catch this one now that we are checking till 2 levels)

stsewd · 2020-06-25T02:46:44Z

readthedocs/search/tests/data/mkdocs/out/mkdocs-1.1.json

@@ -6,7 +6,7 @@
      {
        "id": "mkdocs",
        "title": "MkDocs",
-        "content": "Project documentation with\u00a0Markdown."
+        "content": "Project documentation with Markdown."


No more weird chars now that we are stripping all white spaces :D

Now that we prioritizes the main tag as main node, the main node from the mkdocs material theme is more wide.

stale · 2020-08-16T17:58:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ericholscher

This seems useful, sorry it sat for so long 👍

stsewd force-pushed the improve-parser branch from 627b2a2 to 316b5aa Compare June 25, 2020 02:46

stsewd commented Jun 25, 2020

View reviewed changes

stsewd requested review from ericholscher and a team June 25, 2020 02:49

stsewd added 3 commits June 29, 2020 11:26

Increase depth

f81360a

Now that we prioritizes the main tag as main node, the main node from the mkdocs material theme is more wide.

Strip spaces

5995b71

Merge branch 'master' into improve-parser

b056fc5

stale bot added the Status: stale Issue will be considered inactive soon label Aug 16, 2020

stsewd removed the Status: stale Issue will be considered inactive soon label Aug 16, 2020

ericholscher approved these changes Aug 17, 2020

View reviewed changes

Merge branch 'master' into improve-parser

fe61e76

stsewd merged commit 4638167 into master Aug 17, 2020

stsewd deleted the improve-parser branch August 17, 2020 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: improve parser #7233

Search: improve parser #7233

stsewd commented Jun 25, 2020 •

edited

Loading

stsewd Jun 25, 2020

stale bot commented Aug 16, 2020

ericholscher left a comment

Search: improve parser #7233

Search: improve parser #7233

Conversation

stsewd commented Jun 25, 2020 • edited Loading

stsewd Jun 25, 2020

Choose a reason for hiding this comment

stale bot commented Aug 16, 2020

ericholscher left a comment

Choose a reason for hiding this comment

stsewd commented Jun 25, 2020 •

edited

Loading