Search: support section titles inside header tags

stsewd · stsewd · commit fd5878408930 · 2022-06-15T14:50:30.000-05:00
Another convention to single `h` headers is to put them inside a `header` tag. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/header#usage_notes
diff --git a/docs/dev/search-integration.rst b/docs/dev/search-integration.rst
@@ -30,8 +30,9 @@ Read the Docs makes use of ARIA_ roles and other heuristics in order to process
 Main content node
 ~~~~~~~~~~~~~~~~~
 
-The main content node should have a main role (or a ``main`` tag), and there should only be one per page.
-This node is the one that contains all the page content. Example:
+The main content should be inside a ``main`` tag or an element with the role ``main``,
+and there should only be one per page.
+This node is the one that contains all the page content to be indexed. Example:
 
 .. code-block:: html
    :emphasize-lines: 10-12
@@ -55,6 +56,51 @@ This node is the one that contains all the page content. Example:
       </body>
    </html>
 
+If a main node isn't found,
+we try to infer the main node from the parent of the first section with a ``h1`` tag.
+Example:
+
+.. code-block:: html
+   :emphasize-lines: 10-20
+
+   <html>
+      <head>
+         ...
+      </head>
+      <body>
+         <div>
+            This content isn't processed
+         </div>
+
+         <div id="parent">
+            <h1>First title</h1>
+            <p>
+               The parent of the h1 title will
+               be taken as the main node,
+               this is the div tag.
+            </p>
+
+            <h2>Second title</h2>
+            <p>More content</p>
+         </div>
+      </body>
+   </html>
+
+If a section title isn't found, we default to the ``body`` tag.
+Example:
+
+.. code-block:: html
+   :emphasize-lines: 5-7
+
+   <html>
+      <head>
+         ...
+      </head>
+      <body>
+         <p>Content</p>
+      </body>
+   </html>
+
 Irrelevant content
 ~~~~~~~~~~~~~~~~~~
 
@@ -87,12 +133,15 @@ Example:
 Sections
 ~~~~~~~~
 
-Sections are ``h`` tags, and sections of the same level should be neighbors.
-Additionally, sections should have an unique ``id`` attribute per page (this is used to link to the section).
-All content below the section, till the new section will be indexed as part of the section. Example:
+Sections are composed of a title, and a content.
+A section title can be a ``h`` tag, or a ``header`` tag containing a ``h`` tag,
+the ``h`` tag or its parent can contain an ``id`` attribute, which will be used to link to the section.
+
+All content bellow the title, till a new section is found will be indexed as part of the section content.
+Example:
 
 .. code-block:: html
-   :emphasize-lines: 2-10
+   :emphasize-lines: 2-10, 12-17, 21-26
 
    <div role="main">
       <h1 id="section-title">
@@ -114,17 +163,17 @@ All content below the section, till the new section will be indexed as part of t
 
       ...
 
-      <h1 id="neigbor-section">
-         This section is neighbor of "section-title"
-      </h1>
+      <header>
+         <h1 id="3">This is also a valid section title</h1>
+      </header>
       <p>
-         ...
+         Thi is the content of the third section.
       </p>
    </div>
 
-Sections can be inside till two nested tags (and have nested sections),
-and its immediate parent can contain the ``id`` attribute.
-Note that the section content still needs to be below the ``h`` tag. Example:
+Sections can be contained in up to two nested tags, and can contain other sections (nested sections).
+Note that the section content still needs to be below the section title.
+Example:
 
 .. code-block:: html
    :emphasize-lines: 3-11,14-21
diff --git a/readthedocs/search/parsers.py b/readthedocs/search/parsers.py
@@ -88,10 +88,23 @@ def _get_main_node(self, html):
         # checking for common parents between all h nodes.
         first_header = body.css_first("h1")
         if first_header:
-            return first_header.parent
+            return self._get_header_container(first_header).parent
 
         return body
 
+    def _get_header_container(self, h_tag):
+        """
+        Get the *real* container of a header tag or title.
+
+        If the parent of the ``h`` tag is a ``header`` tag,
+        then we return the ``header`` tag,
+        since the header tag acts as a container for the title of the section.
+        Otherwise, we return the tag itself.
+        """
+        if h_tag.parent.tag == "header":
+            return h_tag.parent
+        return h_tag
+
     def _parse_content(self, content):
         """Converts all new line characters and multiple spaces to a single space."""
         content = content.strip().split()
@@ -110,8 +123,6 @@ def _parse_sections(self, title, body):
         We can have pages that have content before the first title or that don't have a title,
         we index that content first under the title of the original page.
         """
-        body = self._clean_body(body)
-
         # Index content for pages that don't start with a title.
         # We check for sections till 3 levels to avoid indexing all the content
         # in this step.
@@ -135,7 +146,8 @@ def _parse_sections(self, title, body):
             for tag in tags:
                 try:
                     title, id = self._parse_section_title(tag)
-                    content, _ = self._parse_section_content(tag.next, depth=2)
+                    next_tag = self._get_header_container(tag).next
+                    content, _ = self._parse_section_content(next_tag, depth=2)
                     yield {
                         'id': id,
                         'title': title,
@@ -186,10 +198,10 @@ def _is_section(self, tag):
         """
         Check if `tag` is a section (linkeable header).
 
-        The tag is a section if it's a ``h`` tag.
+        The tag is a section if it's a ``h`` or a ``header`` tag.
         """
-        is_header_tag = re.match(r'h\d$', tag.tag)
-        return is_header_tag
+        is_h_tag = re.match(r"h\d$", tag.tag)
+        return is_h_tag or tag.tag == "header"
 
     def _parse_section_title(self, tag):
         """
@@ -199,15 +211,7 @@ def _parse_section_title(self, tag):
 
         - Get the id from the node itself.
         - Get the id from the parent node.
-
-        Additionally:
-
-        - Removes permalink values
         """
-        nodes_to_be_removed = tag.css('.headerlink')
-        for node in nodes_to_be_removed:
-            node.decompose()
-
         section_id = tag.attributes.get('id', '')
         if not section_id:
             parent = tag.parent
@@ -328,6 +332,7 @@ def _process_content(self, page, content):
         title = ""
         sections = []
         if body:
+            body = self._clean_body(body)
             title = self._get_page_title(body, html) or page
             sections = self._get_sections(title=title, body=body)
         else:
@@ -417,7 +422,7 @@ def _process_fjson(self, fjson_path):
 
         if 'body' in data:
             try:
-                body = HTMLParser(data["body"])
+                body = self._clean_body(HTMLParser(data["body"]))
                 sections = self._get_sections(title=title, body=body.body)
             except Exception:
                 log.info('Unable to index sections.', path=fjson_path)