Skip to content

Commit 05799f6

Browse files
stsewdericholscher
andauthored
Search: support section titles inside header tags (#9339)
Another convention to single `h` headers is to put them inside a `header` tag. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/header#usage_notes Co-authored-by: Eric Holscher <[email protected]>
1 parent fe2f79c commit 05799f6

File tree

5 files changed

+234
-29
lines changed

5 files changed

+234
-29
lines changed

docs/dev/search-integration.rst

+62-13
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,9 @@ Read the Docs makes use of ARIA_ roles and other heuristics in order to process
3030
Main content node
3131
~~~~~~~~~~~~~~~~~
3232

33-
The main content node should have a main role (or a ``main`` tag), and there should only be one per page.
34-
This node is the one that contains all the page content. Example:
33+
The main content should be inside a ``<main>`` tag or an element with ``role=main``,
34+
and there should only be one per page.
35+
This node is the one that contains all the page content to be indexed. Example:
3536

3637
.. code-block:: html
3738
:emphasize-lines: 10-12
@@ -55,6 +56,51 @@ This node is the one that contains all the page content. Example:
5556
</body>
5657
</html>
5758

59+
If a main node isn't found,
60+
we try to infer the main node from the parent of the first section with a ``h1`` tag.
61+
Example:
62+
63+
.. code-block:: html
64+
:emphasize-lines: 10-20
65+
66+
<html>
67+
<head>
68+
...
69+
</head>
70+
<body>
71+
<div>
72+
This content isn't processed
73+
</div>
74+
75+
<div id="parent">
76+
<h1>First title</h1>
77+
<p>
78+
The parent of the h1 title will
79+
be taken as the main node,
80+
this is the div tag.
81+
</p>
82+
83+
<h2>Second title</h2>
84+
<p>More content</p>
85+
</div>
86+
</body>
87+
</html>
88+
89+
If a section title isn't found, we default to the ``body`` tag.
90+
Example:
91+
92+
.. code-block:: html
93+
:emphasize-lines: 5-7
94+
95+
<html>
96+
<head>
97+
...
98+
</head>
99+
<body>
100+
<p>Content</p>
101+
</body>
102+
</html>
103+
58104
Irrelevant content
59105
~~~~~~~~~~~~~~~~~~
60106

@@ -87,12 +133,15 @@ Example:
87133
Sections
88134
~~~~~~~~
89135

90-
Sections are ``h`` tags, and sections of the same level should be neighbors.
91-
Additionally, sections should have an unique ``id`` attribute per page (this is used to link to the section).
92-
All content below the section, till the new section will be indexed as part of the section. Example:
136+
Sections are composed of a title, and a content.
137+
A section title can be a ``h`` tag, or a ``header`` tag containing a ``h`` tag,
138+
the ``h`` tag or its parent can contain an ``id`` attribute, which will be used to link to the section.
139+
140+
All content below the title, until a new section is found, will be indexed as part of the section content.
141+
Example:
93142

94143
.. code-block:: html
95-
:emphasize-lines: 2-10
144+
:emphasize-lines: 2-10, 12-17, 21-26
96145

97146
<div role="main">
98147
<h1 id="section-title">
@@ -114,17 +163,17 @@ All content below the section, till the new section will be indexed as part of t
114163

115164
...
116165

117-
<h1 id="neigbor-section">
118-
This section is neighbor of "section-title"
119-
</h1>
166+
<header>
167+
<h1 id="3">This is also a valid section title</h1>
168+
</header>
120169
<p>
121-
...
170+
Thi is the content of the third section.
122171
</p>
123172
</div>
124173

125-
Sections can be inside till two nested tags (and have nested sections),
126-
and its immediate parent can contain the ``id`` attribute.
127-
Note that the section content still needs to be below the ``h`` tag. Example:
174+
Sections can be contained in up to two nested tags, and can contain other sections (nested sections).
175+
Note that the section content still needs to be below the section title.
176+
Example:
128177

129178
.. code-block:: html
130179
:emphasize-lines: 3-11,14-21

readthedocs/search/parsers.py

+21-16
Original file line numberDiff line numberDiff line change
@@ -88,10 +88,23 @@ def _get_main_node(self, html):
8888
# checking for common parents between all h nodes.
8989
first_header = body.css_first("h1")
9090
if first_header:
91-
return first_header.parent
91+
return self._get_header_container(first_header).parent
9292

9393
return body
9494

95+
def _get_header_container(self, h_tag):
96+
"""
97+
Get the *real* container of a header tag or title.
98+
99+
If the parent of the ``h`` tag is a ``header`` tag,
100+
then we return the ``header`` tag,
101+
since the header tag acts as a container for the title of the section.
102+
Otherwise, we return the tag itself.
103+
"""
104+
if h_tag.parent.tag == "header":
105+
return h_tag.parent
106+
return h_tag
107+
95108
def _parse_content(self, content):
96109
"""Converts all new line characters and multiple spaces to a single space."""
97110
content = content.strip().split()
@@ -110,8 +123,6 @@ def _parse_sections(self, title, body):
110123
We can have pages that have content before the first title or that don't have a title,
111124
we index that content first under the title of the original page.
112125
"""
113-
body = self._clean_body(body)
114-
115126
# Index content for pages that don't start with a title.
116127
# We check for sections till 3 levels to avoid indexing all the content
117128
# in this step.
@@ -135,7 +146,8 @@ def _parse_sections(self, title, body):
135146
for tag in tags:
136147
try:
137148
title, id = self._parse_section_title(tag)
138-
content, _ = self._parse_section_content(tag.next, depth=2)
149+
next_tag = self._get_header_container(tag).next
150+
content, _ = self._parse_section_content(next_tag, depth=2)
139151
yield {
140152
'id': id,
141153
'title': title,
@@ -186,10 +198,10 @@ def _is_section(self, tag):
186198
"""
187199
Check if `tag` is a section (linkeable header).
188200
189-
The tag is a section if it's a ``h`` tag.
201+
The tag is a section if it's a ``h`` or a ``header`` tag.
190202
"""
191-
is_header_tag = re.match(r'h\d$', tag.tag)
192-
return is_header_tag
203+
is_h_tag = re.match(r"h\d$", tag.tag)
204+
return is_h_tag or tag.tag == "header"
193205

194206
def _parse_section_title(self, tag):
195207
"""
@@ -199,15 +211,7 @@ def _parse_section_title(self, tag):
199211
200212
- Get the id from the node itself.
201213
- Get the id from the parent node.
202-
203-
Additionally:
204-
205-
- Removes permalink values
206214
"""
207-
nodes_to_be_removed = tag.css('.headerlink')
208-
for node in nodes_to_be_removed:
209-
node.decompose()
210-
211215
section_id = tag.attributes.get('id', '')
212216
if not section_id:
213217
parent = tag.parent
@@ -328,6 +332,7 @@ def _process_content(self, page, content):
328332
title = ""
329333
sections = []
330334
if body:
335+
body = self._clean_body(body)
331336
title = self._get_page_title(body, html) or page
332337
sections = self._get_sections(title=title, body=body)
333338
else:
@@ -417,7 +422,7 @@ def _process_fjson(self, fjson_path):
417422

418423
if 'body' in data:
419424
try:
420-
body = HTMLParser(data["body"])
425+
body = self._clean_body(HTMLParser(data["body"]))
421426
sections = self._get_sections(title=title, body=body.body)
422427
except Exception:
423428
log.info('Unable to index sections.', path=fjson_path)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
<!DOCTYPE html>
2+
3+
<!--
4+
Default pelican theme.
5+
From https://blog.getpelican.com/pelican-4.7-released.html
6+
-->
7+
8+
<html lang="en">
9+
<head>
10+
<meta charset="utf-8">
11+
<title>Pelican 4.7 released</title>
12+
<link rel="stylesheet" href="https://blog.getpelican.com/theme/css/A.main.css.pagespeed.cf.zFbdR40MwZ.css" type="text/css"/>
13+
<link href="https://blog.getpelican.com/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Pelican Development Blog Atom Feed"/>
14+
15+
<!--[if IE]>
16+
<script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
17+
<![endif]-->
18+
</head>
19+
20+
<body id="index" class="home">
21+
<header id="banner" class="body">
22+
<h1><a href="https://blog.getpelican.com/">Pelican Development Blog </a></h1>
23+
<nav><ul>
24+
<li class="active"><a href="https://blog.getpelican.com/category/news.html">news</a></li>
25+
<li><a href="https://docs.getpelican.com">documentation</a></li>
26+
<li><a href="https://donate.getpelican.com">contribute</a></li>
27+
<li><a href="/pages/gratitude.html">gratitude</a></li>
28+
</ul></nav>
29+
</header><!-- /#banner -->
30+
<section id="content" class="body">
31+
<article>
32+
<header>
33+
<h1 class="entry-title">
34+
<a href="https://blog.getpelican.com/pelican-4.7-released.html" rel="bookmark" title="Permalink to Pelican 4.7 released">Pelican 4.7 released</a></h1>
35+
</header>
36+
37+
<div class="entry-content">
38+
<footer class="post-info">
39+
<abbr class="published" title="2021-10-01T00:00:00+02:00">
40+
Fri 01 October 2021
41+
</abbr>
42+
43+
<address class="vcard author">
44+
By <a class="url fn" href="https://blog.getpelican.com/author/pelican-contributors.html">Pelican Contributors</a>
45+
</address>
46+
<p>In <a href="https://blog.getpelican.com/category/news.html">news</a>. </p>
47+
48+
</footer><!-- /.post-info --> <p>Pelican 4.7 is now available. This new release includes the following enhancements, fixes, and tweaks:</p>
49+
<ul class="simple">
50+
<li>Improve default theme rendering on mobile and other small screen devices <a class="reference external" href="https://github.com/getpelican/pelican/pull/2914">(#2914)</a></li>
51+
</ul>
52+
<p>For more info, please refer to the release page.</p>
53+
<div class="section" id="upgrading-from-previous-releases">
54+
<h2>Upgrading from previous releases</h2>
55+
<p>Upgrading from Pelican 4.6.x should be smooth and require few (if any) changes to
56+
your environment.</p>
57+
<p>If you run into problems, please see the <a class="reference external" href="https://docs.getpelican.com/en/latest/contribute.html#how-to-get-help">How to Get Help</a> section
58+
of the documentation, and we will update this post with any upgrade tips
59+
contributed by the Pelican community.</p>
60+
</div>
61+
62+
</div><!-- /.entry-content -->
63+
64+
</article>
65+
</section>
66+
<section id="extras" class="body">
67+
<div class="blogroll">
68+
<h2>links</h2>
69+
<ul>
70+
<li><a href="https://docs.getpelican.com/">Pelican Docs</a></li>
71+
<li><a href="https://donate.getpelican.com/">Support Pelican</a></li>
72+
<li><a href="https://justinmayer.com/">Justin Mayer</a></li>
73+
</ul>
74+
</div><!-- /.blogroll -->
75+
<div class="social">
76+
<h2>follow</h2>
77+
<ul>
78+
<li><a href="https://blog.getpelican.com/feeds/all.atom.xml" type="application/atom+xml" rel="alternate">atom feed</a></li>
79+
80+
<li><a href="https://twitter.com/getpelican">@getpelican</a></li>
81+
<li><a href="https://twitter.com/jmayer">@jmayer</a></li>
82+
<li><a href="https://github.com/getpelican">github</a></li>
83+
</ul>
84+
</div><!-- /.social -->
85+
</section><!-- /#extras -->
86+
87+
<footer id="contentinfo" class="body">
88+
<address id="about" class="vcard body">
89+
Proudly powered by <a href="http://getpelican.com/">Pelican</a>, which takes great advantage of <a href="http://python.org">Python</a>.
90+
</address><!-- /#about -->
91+
92+
<p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
93+
</footer><!-- /#contentinfo -->
94+
95+
<script>(function(f,a,t,h,o,m){a[h]=a[h]||function(){(a[h].q=a[h].q||[]).push(arguments)};o=f.createElement('script'),m=f.getElementsByTagName('script')[0];o.async=1;o.src=t;o.id='fathom-script';m.parentNode.insertBefore(o,m)})(document,window,'https://stats.justinmayer.com/tracker.js','fathom');fathom('set','siteId','EWNWB');fathom('trackPageview');</script>
96+
<script type="text/javascript">var _gaq=_gaq||[];_gaq.push(['_setAccount','UA-295694-7']);_gaq.push(['_trackPageview']);(function(){var ga=document.createElement('script');ga.type='text/javascript';ga.async=true;ga.src=('https:'==document.location.protocol?'https://ssl':'http://www')+'.google-analytics.com/ga.js';var s=document.getElementsByTagName('script')[0];s.parentNode.insertBefore(ga,s);})();</script>
97+
</body>
98+
</html>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
[
2+
{
3+
"path": "index.html",
4+
"title": "Pelican Development Blog",
5+
"sections": [
6+
{
7+
"id": "banner",
8+
"title": "Pelican Development Blog",
9+
"content": ""
10+
},
11+
{
12+
"id": "",
13+
"title": "Pelican 4.7 released",
14+
"content": "Fri 01 October 2021 By Pelican Contributors In news. Pelican 4.7 is now available. This new release includes the following enhancements, fixes, and tweaks: Improve default theme rendering on mobile and other small screen devices (#2914) For more info, please refer to the release page."
15+
},
16+
{
17+
"id": "upgrading-from-previous-releases",
18+
"title": "Upgrading from previous releases",
19+
"content": "Upgrading from Pelican 4.6.x should be smooth and require few (if any) changes to your environment. If you run into problems, please see the How to Get Help section of the documentation, and we will update this post with any upgrade tips contributed by the Pelican community."
20+
},
21+
{
22+
"id": "",
23+
"title": "links",
24+
"content": "Pelican Docs Support Pelican Justin Mayer"
25+
},
26+
{
27+
"id": "",
28+
"title": "follow",
29+
"content": "atom feed @getpelican @jmayer github"
30+
}
31+
],
32+
"domain_data": {}
33+
}
34+
]

readthedocs/search/tests/test_parsers.py

+19
Original file line numberDiff line numberDiff line change
@@ -303,3 +303,22 @@ def test_generic_simple_page(self, storage_open, storage_exists):
303303
parsed_json = [file.processed_json]
304304
expected_json = json.load(open(data_path / "generic/out/basic.json"))
305305
assert parsed_json == expected_json
306+
307+
@mock.patch.object(BuildMediaFileSystemStorage, "exists")
308+
@mock.patch.object(BuildMediaFileSystemStorage, "open")
309+
def test_generic_pelican_default_theme(self, storage_open, storage_exists):
310+
file = data_path / "pelican/in/default/index.html"
311+
storage_exists.return_value = True
312+
self.version.documentation_type = GENERIC
313+
self.version.save()
314+
315+
storage_open.side_effect = self._mock_open(file.open().read())
316+
file = get(
317+
HTMLFile,
318+
project=self.project,
319+
version=self.version,
320+
path="index.html",
321+
)
322+
parsed_json = [file.processed_json]
323+
expected_json = json.load(open(data_path / "pelican/out/default.json"))
324+
assert parsed_json == expected_json

0 commit comments

Comments
 (0)