Skip to content

Commit fd58784

Browse files
committed
Search: support section titles inside header tags
Another convention to single `h` headers is to put them inside a `header` tag. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/header#usage_notes
1 parent defc159 commit fd58784

File tree

2 files changed

+83
-29
lines changed

2 files changed

+83
-29
lines changed

docs/dev/search-integration.rst

Lines changed: 62 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,9 @@ Read the Docs makes use of ARIA_ roles and other heuristics in order to process
3030
Main content node
3131
~~~~~~~~~~~~~~~~~
3232

33-
The main content node should have a main role (or a ``main`` tag), and there should only be one per page.
34-
This node is the one that contains all the page content. Example:
33+
The main content should be inside a ``main`` tag or an element with the role ``main``,
34+
and there should only be one per page.
35+
This node is the one that contains all the page content to be indexed. Example:
3536

3637
.. code-block:: html
3738
:emphasize-lines: 10-12
@@ -55,6 +56,51 @@ This node is the one that contains all the page content. Example:
5556
</body>
5657
</html>
5758

59+
If a main node isn't found,
60+
we try to infer the main node from the parent of the first section with a ``h1`` tag.
61+
Example:
62+
63+
.. code-block:: html
64+
:emphasize-lines: 10-20
65+
66+
<html>
67+
<head>
68+
...
69+
</head>
70+
<body>
71+
<div>
72+
This content isn't processed
73+
</div>
74+
75+
<div id="parent">
76+
<h1>First title</h1>
77+
<p>
78+
The parent of the h1 title will
79+
be taken as the main node,
80+
this is the div tag.
81+
</p>
82+
83+
<h2>Second title</h2>
84+
<p>More content</p>
85+
</div>
86+
</body>
87+
</html>
88+
89+
If a section title isn't found, we default to the ``body`` tag.
90+
Example:
91+
92+
.. code-block:: html
93+
:emphasize-lines: 5-7
94+
95+
<html>
96+
<head>
97+
...
98+
</head>
99+
<body>
100+
<p>Content</p>
101+
</body>
102+
</html>
103+
58104
Irrelevant content
59105
~~~~~~~~~~~~~~~~~~
60106

@@ -87,12 +133,15 @@ Example:
87133
Sections
88134
~~~~~~~~
89135

90-
Sections are ``h`` tags, and sections of the same level should be neighbors.
91-
Additionally, sections should have an unique ``id`` attribute per page (this is used to link to the section).
92-
All content below the section, till the new section will be indexed as part of the section. Example:
136+
Sections are composed of a title, and a content.
137+
A section title can be a ``h`` tag, or a ``header`` tag containing a ``h`` tag,
138+
the ``h`` tag or its parent can contain an ``id`` attribute, which will be used to link to the section.
139+
140+
All content bellow the title, till a new section is found will be indexed as part of the section content.
141+
Example:
93142

94143
.. code-block:: html
95-
:emphasize-lines: 2-10
144+
:emphasize-lines: 2-10, 12-17, 21-26
96145

97146
<div role="main">
98147
<h1 id="section-title">
@@ -114,17 +163,17 @@ All content below the section, till the new section will be indexed as part of t
114163

115164
...
116165

117-
<h1 id="neigbor-section">
118-
This section is neighbor of "section-title"
119-
</h1>
166+
<header>
167+
<h1 id="3">This is also a valid section title</h1>
168+
</header>
120169
<p>
121-
...
170+
Thi is the content of the third section.
122171
</p>
123172
</div>
124173

125-
Sections can be inside till two nested tags (and have nested sections),
126-
and its immediate parent can contain the ``id`` attribute.
127-
Note that the section content still needs to be below the ``h`` tag. Example:
174+
Sections can be contained in up to two nested tags, and can contain other sections (nested sections).
175+
Note that the section content still needs to be below the section title.
176+
Example:
128177

129178
.. code-block:: html
130179
:emphasize-lines: 3-11,14-21

readthedocs/search/parsers.py

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -88,10 +88,23 @@ def _get_main_node(self, html):
8888
# checking for common parents between all h nodes.
8989
first_header = body.css_first("h1")
9090
if first_header:
91-
return first_header.parent
91+
return self._get_header_container(first_header).parent
9292

9393
return body
9494

95+
def _get_header_container(self, h_tag):
96+
"""
97+
Get the *real* container of a header tag or title.
98+
99+
If the parent of the ``h`` tag is a ``header`` tag,
100+
then we return the ``header`` tag,
101+
since the header tag acts as a container for the title of the section.
102+
Otherwise, we return the tag itself.
103+
"""
104+
if h_tag.parent.tag == "header":
105+
return h_tag.parent
106+
return h_tag
107+
95108
def _parse_content(self, content):
96109
"""Converts all new line characters and multiple spaces to a single space."""
97110
content = content.strip().split()
@@ -110,8 +123,6 @@ def _parse_sections(self, title, body):
110123
We can have pages that have content before the first title or that don't have a title,
111124
we index that content first under the title of the original page.
112125
"""
113-
body = self._clean_body(body)
114-
115126
# Index content for pages that don't start with a title.
116127
# We check for sections till 3 levels to avoid indexing all the content
117128
# in this step.
@@ -135,7 +146,8 @@ def _parse_sections(self, title, body):
135146
for tag in tags:
136147
try:
137148
title, id = self._parse_section_title(tag)
138-
content, _ = self._parse_section_content(tag.next, depth=2)
149+
next_tag = self._get_header_container(tag).next
150+
content, _ = self._parse_section_content(next_tag, depth=2)
139151
yield {
140152
'id': id,
141153
'title': title,
@@ -186,10 +198,10 @@ def _is_section(self, tag):
186198
"""
187199
Check if `tag` is a section (linkeable header).
188200
189-
The tag is a section if it's a ``h`` tag.
201+
The tag is a section if it's a ``h`` or a ``header`` tag.
190202
"""
191-
is_header_tag = re.match(r'h\d$', tag.tag)
192-
return is_header_tag
203+
is_h_tag = re.match(r"h\d$", tag.tag)
204+
return is_h_tag or tag.tag == "header"
193205

194206
def _parse_section_title(self, tag):
195207
"""
@@ -199,15 +211,7 @@ def _parse_section_title(self, tag):
199211
200212
- Get the id from the node itself.
201213
- Get the id from the parent node.
202-
203-
Additionally:
204-
205-
- Removes permalink values
206214
"""
207-
nodes_to_be_removed = tag.css('.headerlink')
208-
for node in nodes_to_be_removed:
209-
node.decompose()
210-
211215
section_id = tag.attributes.get('id', '')
212216
if not section_id:
213217
parent = tag.parent
@@ -328,6 +332,7 @@ def _process_content(self, page, content):
328332
title = ""
329333
sections = []
330334
if body:
335+
body = self._clean_body(body)
331336
title = self._get_page_title(body, html) or page
332337
sections = self._get_sections(title=title, body=body)
333338
else:
@@ -417,7 +422,7 @@ def _process_fjson(self, fjson_path):
417422

418423
if 'body' in data:
419424
try:
420-
body = HTMLParser(data["body"])
425+
body = self._clean_body(HTMLParser(data["body"]))
421426
sections = self._get_sections(title=title, body=body.body)
422427
except Exception:
423428
log.info('Unable to index sections.', path=fjson_path)

0 commit comments

Comments
 (0)