Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

Increase the amount of data we're saving during build #62

Merged
merged 17 commits into from
Jul 2, 2019

Conversation

ericholscher
Copy link
Member

This is used for better search indexing

@ericholscher ericholscher requested a review from a team March 1, 2019 14:59
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this will be used and where. In the end we are generating a .json file here with paths, types and titles but I can't give good feedback here because I'm lacking context.

build_json = os.path.abspath(
os.path.join(app.outdir, '..', 'json')
)
outjson = os.path.join(build_json, 'readthedocs-sphinx-domain-names' + '.json')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The + can be removed.

@ericholscher
Copy link
Member Author

It's used in readthedocs/readthedocs.org#5290

Copy link
Member

@stsewd stsewd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

]
# Only run JSON output once during HTML build
# This saves resources and keeps filepaths correct,
# because singlehtml filepaths are different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, search isn't supported in singlhtml builders :)


paths = {}
for page, title in app.env.titles.items():
paths[app.builder.get_target_uri(page)] = app.env.doc2path(page, base=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this with the other for loop, but I guess this is more readable, and not so much difference in the performance with our big servers p:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to merge the loops as well. We can also calculate the get_target_uri once only.

@humitos
Copy link
Member

humitos commented Jun 10, 2019

Is this PR still relevant? If so, what's missing?

@ericholscher
Copy link
Member Author

Yes, I just need to finish testing it.

@ericholscher
Copy link
Member Author

Alright, this should be ready for re-review.

@ericholscher ericholscher requested a review from a team June 25, 2019 23:04
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good to me!

To be able to give better feedback, I'd like to know where this will be used (and how) so I can check that we have the data and the structure we need.

Finally, the code is hard to follow because there are some Sphinx internals that are not very explicit. So, I'd like to see some comments next to functions like app.env.doc2path or objects like app.env.domains.name/app.env.domains.type with an example of the expected result.


paths = {}
for page, title in app.env.titles.items():
paths[app.builder.get_target_uri(page)] = app.env.doc2path(page, base=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to merge the loops as well. We can also calculate the get_target_uri once only.

'_build/json/readthedocs-sphinx-domain-names.json',
[
'py:exception', 'js:class',
'"index.html": "Welcome to pyexample',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go more strict here and compare against the full JSON file if it's not incredibly big, or at least a portion of it.

Reading the test it's hard to realize what we are expecting to be the content of that file. From the Python function it seems there is a specific structure containing types, titles and paths.

build_json = os.path.abspath(
os.path.join(app.outdir, '..', 'json')
)
outjson = os.path.join(build_json, 'readthedocs-sphinx-domain-names.json')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to get this path could be moved to a function (on utils or similar) since we are repeating it from generate_json_artifacts.

log.exception(
'Failure in JSON search dump for page {page}'.format(page=outjson)
)


def dump_sphinx_data(app, exception):
"""
Dump a bunch of additional Sphinx data that is useful during search indexing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring could be expanded with the output structure expected of the dict mentioning what is each field and some examples about how are the values (paths --to see if they are relative/absolute, etc) for them.

log.exception(
'Failure in JSON search dump for page {page}'.format(page=outjson)
)


def dump_sphinx_data(app, exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle the case where exception is not None and just return.

@humitos
Copy link
Member

humitos commented Jun 26, 2019

To be able to give better feedback, I'd like to know where this will be used (and how) so I can check that we have the data and the structure we need.

Nevermind. I found the exact place:

https://github.com/rtfd/readthedocs.org/blob/20b028521d97b797474b05ebe6b7fef9d84142a8/readthedocs/projects/tasks.py#L1291-L1390

@humitos
Copy link
Member

humitos commented Jun 26, 2019

As a general question, why don't we save everything that we need in our own JSON created by this extension instead of mixing some data from here and some data parsed from objects.inv inside the Django App? I find the current approach complex.

It seems that we are processing the data in two different places when we can do it only in this extension and then only use it from the Django App to create the SphinxDomain objects.

Also, I didn't find where the paths key is used.

@ericholscher
Copy link
Member Author

Also, I didn't find where the paths key is used.

I'd like to get paths stored on the HTMLFile object I think, in order to keep track of where a file came from.

@ericholscher
Copy link
Member Author

It seems that we are processing the data in two different places when we can do it only in this extension and then only use it from the Django App to create the SphinxDomain objects.

Agreed. This is probably a better path forward. I will think about if we should do it now, or later.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enough to be merged and start using it :)

We can improve it later if we want. In particular the processing of the data in multiple places. Also, there are some refactor that can be done.

@ericholscher
Copy link
Member Author

The main thing w/ parsing the data multiple places is all the old repos that we need to support. This will only work for indexing going forward, but objects.inv will always exist.

@ericholscher ericholscher merged commit cad5845 into master Jul 2, 2019
@stsewd stsewd deleted the get-domain-data branch January 14, 2021 20:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants