Skip to content

Commit 9d5387d

Browse files
authored
Search: index from html files for mkdocs projects (#7208)
* Search: index from html files for mkdocs projects It's under a feature flag currently, since I would like to test this with sphinx later. Tested with all themes available in https://github.com/mkdocs/mkdocs/wiki/MkDocs-Themes I did a search in all of them to check that sections are correctly indexed, and that irrelevant content isn't indexed. I wasn't able to build these, but I checked the html, and it's very similar to the default mkdocs theme, so it should work. - https://github.com/michaeltlombardi/mkdocs-psinder - https://github.com/byrnereese/mkdocs-bootstrap4 - https://github.com/daizutabi/mkdocs-ivory These themes don't expose search, but the content is indexed and we show results in the dashboard. - https://gitlab.com/lramage/mkdocs-bootstrap386 - https://gitlab.com/lramage/mkdocs-gitbook-theme Things to do later: Write a guide with recommendations for static site generators/themes authors, to integrate nice with Read the Docs. Other problems I found: We don't override the search for https://github.com/squidfunk/mkdocs-material (they don't use the identifier we use to override search). That isn't related to this, but just wanted to mention it. We can send a patch to fix that. You can check the results by either building mkdocs projects and searching (you need to enable the mkdocs search feature flag and the index from html feature flag), or just by opening the html files and checking the content of the json files from tests and see if the indexed content makes sense (note that I remove some content from the original html file to make it easier to read the json files, since json doesn't support multiple lines...) * Add note about circular import
1 parent ec9022c commit 9d5387d

19 files changed

+4046
-3
lines changed

readthedocs/projects/models.py

+5
Original file line numberDiff line numberDiff line change
@@ -1540,6 +1540,7 @@ def add_features(sender, **kwargs):
15401540
DEDUPLICATE_BUILDS = 'deduplicate_builds'
15411541
USE_SPHINX_RTD_EXT_LATEST = 'rtd_sphinx_ext_latest'
15421542
DEFAULT_TO_FUZZY_SEARCH = 'default_to_fuzzy_search'
1543+
INDEX_FROM_HTML_FILES = 'index_from_html_files'
15431544

15441545
FEATURES = (
15451546
(USE_SPHINX_LATEST, _('Use latest version of Sphinx')),
@@ -1661,6 +1662,10 @@ def add_features(sender, **kwargs):
16611662
DEFAULT_TO_FUZZY_SEARCH,
16621663
_('Default to fuzzy search for simple search queries'),
16631664
),
1665+
(
1666+
INDEX_FROM_HTML_FILES,
1667+
_('Index content directly from html files instead or relying in other sources'),
1668+
),
16641669
)
16651670

16661671
projects = models.ManyToManyField(

readthedocs/search/parsers.py

+110-1
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,68 @@ def __init__(self, version):
2121
self.project = self.version.project
2222
self.storage = get_storage_class(settings.RTD_BUILD_MEDIA_STORAGE)()
2323

24+
def _get_page_content(self, page):
25+
"""Gets the page content from storage."""
26+
content = None
27+
try:
28+
storage_path = self.project.get_storage_path(
29+
type_='html',
30+
version_slug=self.version.slug,
31+
include_file=False,
32+
)
33+
file_path = self.storage.join(storage_path, page)
34+
with self.storage.open(file_path, mode='r') as f:
35+
content = f.read()
36+
except Exception:
37+
log.warning(
38+
'Unhandled exception during search processing file: %s',
39+
page,
40+
)
41+
return content
42+
43+
def _get_page_title(self, body, html):
44+
"""
45+
Gets the title from the html page.
46+
47+
The title is the first section in the document,
48+
falling back to the ``title`` tag.
49+
"""
50+
first_header = body.css_first('h1')
51+
if first_header:
52+
title, _ = self._parse_section_title(first_header)
53+
return title
54+
55+
title = html.css_first('title')
56+
if title:
57+
return self._parse_content(title.text())
58+
59+
return None
60+
61+
def _get_main_node(self, html):
62+
"""
63+
Gets the main node from where to start indexing content.
64+
65+
The main node is tested in the following order:
66+
67+
- Try with a tag with the ``main`` role.
68+
This role is used by several static sites and themes.
69+
- Try the first ``h1`` node and return its parent
70+
Usually all sections are neighbors,
71+
so they are children of the same parent node.
72+
"""
73+
body = html.body
74+
main_node = body.css_first('[role=main]')
75+
if main_node:
76+
return main_node
77+
78+
# TODO: this could be done in smarter way,
79+
# checking for common parents between all h nodes.
80+
first_header = body.css_first('h1')
81+
if first_header:
82+
return first_header.parent
83+
84+
return None
85+
2486
def _parse_content(self, content):
2587
"""Removes new line characters and strips all whitespaces."""
2688
content = content.strip().split('\n')
@@ -404,9 +466,56 @@ def _parse_domain_tag(self, tag):
404466

405467
class MkDocsParser(BaseParser):
406468

407-
"""MkDocs parser, it relies on the json index files."""
469+
"""
470+
MkDocs parser.
471+
472+
Index from the json index file or directly from the html content.
473+
"""
408474

409475
def parse(self, page):
476+
# Avoid circular import
477+
from readthedocs.projects.models import Feature
478+
if self.project.has_feature(Feature.INDEX_FROM_HTML_FILES):
479+
return self.parse_from_html(page)
480+
return self.parse_from_index_file(page)
481+
482+
def parse_from_html(self, page):
483+
try:
484+
content = self._get_page_content(page)
485+
if content:
486+
return self._process_content(page, content)
487+
except Exception as e:
488+
log.info('Failed to index page %s, %s', page, str(e))
489+
return {
490+
'path': page,
491+
'title': '',
492+
'sections': [],
493+
'domain_data': {},
494+
}
495+
496+
def _process_content(self, page, content):
497+
"""Parses the content into a structured dict."""
498+
html = HTMLParser(content)
499+
body = self._get_main_node(html)
500+
title = ""
501+
sections = []
502+
if body:
503+
title = self._get_page_title(body, html) or page
504+
sections = list(self._parse_sections(title, body))
505+
else:
506+
log.info(
507+
'Page doesn\'t look like it has valid content, skipping. '
508+
'page=%s',
509+
page,
510+
)
511+
return {
512+
'path': page,
513+
'title': title,
514+
'sections': sections,
515+
'domain_data': {},
516+
}
517+
518+
def parse_from_index_file(self, page):
410519
storage_path = self.project.get_storage_path(
411520
type_='html',
412521
version_slug=self.version.slug,
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
<!DOCTYPE html>
2+
3+
<!--
4+
Gitbook them https://gitlab.com/lramage/mkdocs-gitbook-theme
5+
From https://lramage.gitlab.io/mkdocs-gitbook-theme/
6+
-->
7+
8+
<html lang="en">
9+
<head>
10+
<meta charset="utf-8">
11+
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
12+
<title>Mkdocs - GitBook Theme - Mkdocs - GitBook Theme</title>
13+
<meta http-equiv="X-UA-Compatible" content="IE=edge">
14+
15+
<meta name="generator" content="mkdocs-1.1.2, mkdocs-gitbook-1.0.7">
16+
17+
<link rel="shortcut icon" href="./images/favicon.ico" type="image/x-icon">
18+
<meta name="HandheldFriendly" content="true"/>
19+
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
20+
<meta name="apple-mobile-web-app-capable" content="yes">
21+
<meta name="apple-mobile-web-app-status-bar-style" content="black">
22+
<meta rel="next" href="" />
23+
<link href="./css/style.min.css" rel="stylesheet">
24+
</head>
25+
26+
<body>
27+
<div class="book">
28+
<div class="book-summary">
29+
30+
<nav role="navigation">
31+
<ul class="summary">
32+
<li>
33+
<a href="." target="_blank" class="custom-link">Mkdocs - GitBook Theme</a>
34+
</li>
35+
<li class="divider"></li>
36+
<li class="chapter active" data-path="">
37+
<a href=".">Mkdocs - GitBook Theme</a>
38+
<li class="header">Post</li>
39+
40+
<li>
41+
<a href="post/2015-10-30/" class="">Oldest Post</a>
42+
</li>
43+
44+
<li>
45+
<a href="post/2018-12-31/" class="">Older Post</a>
46+
</li>
47+
48+
<li>
49+
<a href="post/2019-01-02/" class="">Latest Post</a>
50+
</li>
51+
52+
<li class="divider"></li>
53+
54+
55+
56+
<li><a href="http://www.mkdocs.org">
57+
Published with MkDocs
58+
</a></li>
59+
60+
<li><a href="https://github.com/GitbookIO/theme-default">
61+
Theme by GitBook
62+
</a></li>
63+
</ul>
64+
65+
</nav>
66+
67+
</div> <!-- end of book-summary -->
68+
69+
<div class="book-body">
70+
<div class="body-inner">
71+
<div class="book-header" role="navigation">
72+
73+
<!-- Title -->
74+
<h1>
75+
<i class="fa fa-circle-o-notch fa-spin"></i>
76+
<a href="." ></a>
77+
</h1>
78+
79+
</div> <!-- end of book-header -->
80+
81+
<div class="page-wrapper" tabindex="-1" role="main">
82+
<div class="page-inner">
83+
84+
<section class="normal markdown-section">
85+
86+
<h1 id="mkdocs-gitbook-theme">Mkdocs - GitBook Theme</h1>
87+
<p><a href="LICENSE"><img alt="Apache 2.0 License" src="https://img.shields.io/badge/license-Apache--2.0-blue.svg?style=flat-square" /></a>
88+
<a href="https://pypi.python.org/pypi/mkdocs-gitbook"><img alt="PyPI" src="https://img.shields.io/pypi/v/mkdocs-gitbook.svg?style=flat-square" /></a></p>
89+
<h2 id="installation">Installation</h2>
90+
<p>First, install the package via PyPI:</p>
91+
<pre><code class="sh">pip install mkdocs-gitbook
92+
</code></pre>
93+
94+
<p>Then include the theme in your <code>mkdocs.yml</code> file:</p>
95+
<pre><code class="yaml">theme:
96+
name: gitbook
97+
</code></pre>
98+
99+
<h2 id="motivation">Motivation</h2>
100+
<p>Gitbook was a static-site generator written in JavaScript.</p>
101+
<p>Mkdocs is a static-site generator written in Python.</p>
102+
<p><strong>Gitbook is <a href="https://docs.gitbook.com/v2-changes/important-differences#cli-toolchain">no longer a static-site generator</a>, <a href="https://docs.gitbook.com/v2-changes/important-differences#git-hosting-and-integration">nor does it use git</a>, nor is it <a href="https://www.gnu.org/philosophy/free-sw.html">free</a> or <a href="https://opensource.org/osd">open source</a>!</strong></p>
103+
<h2 id="screenshot">Screenshot</h2>
104+
<p><a href="https://gitlab.com/lramage/mkdocs-gitbook-theme"><img src="img/screenshot.png" alt="Default theme for GitBook for Mkdocs"></a></p>
105+
<h2 id="license">License</h2>
106+
<p>SPDX-License-Identifier: <a href="https://spdx.org/licenses/Apache-2.0">Apache-2.0</a></p>
107+
108+
109+
</section>
110+
111+
</div> <!-- end of page-inner -->
112+
</div> <!-- end of page-wrapper -->
113+
114+
</div> <!-- end of body-inner -->
115+
116+
</div> <!-- end of book-body -->
117+
<script src="./js/main.js"></script>
118+
<script src="./js/gitbook.min.js"></script>
119+
<script src="./js/theme.min.js"></script>
120+
</body>
121+
</html>

0 commit comments

Comments
 (0)