Skip to content

EmbedAPIv3: updates after QA with sphinx-hoverxref #8521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 27, 2021
65 changes: 34 additions & 31 deletions readthedocs/embed/v3/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import requests

from selectolax.parser import HTMLParser
from pyquery import PyQuery as PQ # noqa

from django.conf import settings
from django.core.cache import cache
Expand Down Expand Up @@ -71,12 +70,15 @@ def _download_page_content(self, url):

response = requests.get(url, timeout=settings.RTD_EMBED_API_DEFAULT_REQUEST_TIMEOUT)
if response.ok:
# NOTE: we use ``response.content`` to get its binary
# representation. Then ``selectolax`` is in charge to auto-detect
# its encoding. We trust more in selectolax for this than in requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we trust it more? Is there a specific issue we've seen? I'm guessing that requests has a lot more eyes on it, but perhaps we're doing something different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have a good amount of data tested yet. However, I've already find myself doing the solution from this SO question: https://stackoverflow.com/a/52615216

r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text

Then, I read that selectolax also uses the meta-tags from the HTML (https://selectolax.readthedocs.io/en/latest/parser.html#selectolax.parser.HTMLParser) to guess the encoding and it always worked without manual interaction as with requests.

Anyways, I don't have strong opinions here. I suppose whatever we choose here, it will fail in some case that the other library wouldn't 🙃

cache.set(
cache_key,
response.text,
response.content,
timeout=settings.RTD_EMBED_API_PAGE_CACHE_TIMEOUT,
)
return response.text
return response.content

def _get_page_content_from_storage(self, project, version_slug, filename):
version = get_object_or_404(
Expand Down Expand Up @@ -138,7 +140,11 @@ def _parse_based_on_doctool(self, page_content, fragment, doctool, doctoolversio

node = None
if fragment:
selector = f'#{fragment}'
# NOTE: we use the `[id=]` selector because using `#{id}` requires
# escaping the selector since CSS does not support the same
# characters as the `id=` HTML attribute
# https://www.w3.org/TR/CSS21/syndata.html#value-def-identifier
selector = f'[id="{fragment}"]'
node = HTMLParser(page_content).css_first(selector)
else:
html = HTMLParser(page_content)
Expand All @@ -150,7 +156,10 @@ def _parse_based_on_doctool(self, page_content, fragment, doctool, doctoolversio
if doctool == 'sphinx':
# Handle ``dt`` special cases
if node.tag == 'dt':
if 'glossary' in node.parent.attributes.get('class'):
if any([
'glossary' in node.parent.attributes.get('class'),
'citation' in node.parent.attributes.get('class'),
]):
# Sphinx HTML structure for term glossary puts the ``id`` in the
# ``dt`` element with the title of the term. In this case, we
# return the parent node which contains the definition list
Expand All @@ -163,32 +172,26 @@ def _parse_based_on_doctool(self, page_content, fragment, doctool, doctoolversio
# ...
# </dl>

# TODO: figure it out if it's needed to remove the siblings here
# parent = node.parent
# for n in parent.traverse():
# if n not in (node, node.next):
# n.remove()
node = node.parent

elif 'citation' in node.parent.attributes.get('class'):
# Sphinx HTML structure for sphinxcontrib-bibtex puts the ``id`` in the
# ``dt`` element with the title of the cite. In this case, we
# return the parent node which contains the definition list
# and remove all ``dt/dd`` that are not the requested one

# Structure:
# <dl class="citation">
# <dt id="cite-id"><span><a>Title of the cite</a></span></dt>
# <dd>Content of the cite</dd>
# ...
# </dl>

# TODO: figure it out if it's needed to remove the siblings here
# parent = node.parent
# for n in parent.traverse():
# if n not in (node, node.next):
# n.remove()
node = node.parent
parent_node = node.parent
if 'glossary' in node.parent.attributes.get('class'):
next_node = node.next

elif 'citation' in node.parent.attributes.get('class'):
next_node = node.next.next

# Iterate over all the siblings (``.iter()``) of the parent
# node and remove ``dt`` and ``dd`` that are not the ones
# we are looking for. Then return the parent node as
# result.
#
# Note that ``.iter()`` returns a generator and we modify
# the HTML in-place, so we have to convert it to a list
# before removing elements. Otherwise we break the
# iteration before completing it
for n in list(parent_node.iter()): # pylint: disable=invalid-name
if n not in (node, next_node):
n.remove()
node = parent_node

else:
# Sphinx HTML structure for definition list puts the ``id``
Expand Down
2 changes: 2 additions & 0 deletions readthedocs/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -818,6 +818,8 @@ def DOCKER_LIMITS(self):
r'docs\.python\.org',
r'docs\.scipy\.org',
r'docs\.sympy\.org',
r'www.sphinx-doc.org',
r'numpy\.org',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely something that should be defined in the DB, or somewhere else that doesn't require a deploy to modify. It's definitely something we're going to need to iterate on pretty frequently I'm guessing, and perhaps even support customization at a project or organization level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We do have an issue opened for Django settings in the DB at https://github.com/readthedocs/readthedocs-ops/issues/1004 but I don't think that will progress too much. If that's the case, we can create a simple EmbedDomain model that we can relate to a Project and/or Organization as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #8530 to continue the conversation there.

]
RTD_EMBED_API_PAGE_CACHE_TIMEOUT = 5 * 10
RTD_EMBED_API_DEFAULT_REQUEST_TIMEOUT = 1
Expand Down
7 changes: 6 additions & 1 deletion readthedocs/settings/docker_compose.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,12 @@ def RTD_EXT_THEME_DEV_SERVER(self):
@property
def RTD_EMBED_API_EXTERNAL_DOMAINS(self):
domains = super().RTD_EMBED_API_EXTERNAL_DOMAINS
domains.append(r'.*\.readthedocs\.io')
domains.extend([
r'.*\.readthedocs\.io',
r'.*\.org\.readthedocs\.build',
r'.*\.readthedocs-hosted\.com',
r'.*\.com\.readthedocs\.build',
])
return domains

@property
Expand Down
3 changes: 2 additions & 1 deletion tox.embedapi.ini
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[tox]
envlist = sphinx-{18,20,21,22,23,24,30,31,32,33,34,35,40,41,latest}
envlist = sphinx-{18,20,21,22,23,24,30,31,32,33,34,35,40,41,42,latest}

[testenv]
description = run test suite for the EmbedAPIv3
Expand Down Expand Up @@ -27,6 +27,7 @@ deps =
sphinx-35: Sphinx~=3.5.0
sphinx-40: Sphinx~=4.0.0
sphinx-41: Sphinx~=4.1.0
sphinx-42: Sphinx~=4.2.0
sphinx-latest: Sphinx
setenv =
DJANGO_SETTINGS_MODULE=readthedocs.settings.test
Expand Down