Increase the amount of data we're saving during build #62

ericholscher · 2019-02-28T19:42:17Z

This is used for better search indexing

humitos

I'm not sure how this will be used and where. In the end we are generating a .json file here with paths, types and titles but I can't give good feedback here because I'm lacking context.

humitos · 2019-03-04T10:23:34Z

readthedocs_ext/readthedocs.py

+        build_json = os.path.abspath(
+            os.path.join(app.outdir, '..', 'json')
+        )
+        outjson = os.path.join(build_json, 'readthedocs-sphinx-domain-names' + '.json')


The + can be removed.

ericholscher · 2019-03-04T12:57:23Z

It's used in readthedocs/readthedocs.org#5290

stsewd

The implementation looks correct for me. We should write a test like this one at least https://github.com/rtfd/readthedocs-sphinx-ext/blob/f1a01c51c675d36ac365162ea06814544c2aa410/tests/test_integration.py#L65-L73

stsewd · 2019-03-04T16:28:51Z

readthedocs_ext/readthedocs.py

+]
+# Only run JSON output once during HTML build
+# This saves resources and keeps filepaths correct,
+# because singlehtml filepaths are different


Also, search isn't supported in singlhtml builders :)

stsewd · 2019-03-04T16:43:26Z

readthedocs_ext/readthedocs.py

+
+        paths = {}
+        for page, title in app.env.titles.items():
+            paths[app.builder.get_target_uri(page)] = app.env.doc2path(page, base=None)


We can merge this with the other for loop, but I guess this is more readable, and not so much difference in the performance with our big servers p:

I think it's better to merge the loops as well. We can also calculate the get_target_uri once only.

humitos · 2019-06-10T12:23:53Z

Is this PR still relevant? If so, what's missing?

ericholscher · 2019-06-10T22:01:22Z

Yes, I just need to finish testing it.

ericholscher · 2019-06-25T23:03:52Z

Alright, this should be ready for re-review.

humitos

Logic looks good to me!

To be able to give better feedback, I'd like to know where this will be used (and how) so I can check that we have the data and the structure we need.

Finally, the code is hard to follow because there are some Sphinx internals that are not very explicit. So, I'd like to see some comments next to functions like app.env.doc2path or objects like app.env.domains.name/app.env.domains.type with an example of the expected result.

humitos · 2019-06-26T11:33:55Z

readthedocs_ext/readthedocs.py

+
+        paths = {}
+        for page, title in app.env.titles.items():
+            paths[app.builder.get_target_uri(page)] = app.env.doc2path(page, base=None)


I think it's better to merge the loops as well. We can also calculate the get_target_uri once only.

humitos · 2019-06-26T11:41:34Z

tests/test_integration.py

+            '_build/json/readthedocs-sphinx-domain-names.json',
+            [
+                'py:exception', 'js:class',
+                '"index.html": "Welcome to pyexample',


I think we can go more strict here and compare against the full JSON file if it's not incredibly big, or at least a portion of it.

Reading the test it's hard to realize what we are expecting to be the content of that file. From the Python function it seems there is a specific structure containing types, titles and paths.

humitos · 2019-06-26T11:44:06Z

readthedocs_ext/readthedocs.py

+        build_json = os.path.abspath(
+            os.path.join(app.outdir, '..', 'json')
+        )
+        outjson = os.path.join(build_json, 'readthedocs-sphinx-domain-names.json')


The logic to get this path could be moved to a function (on utils or similar) since we are repeating it from generate_json_artifacts.

humitos · 2019-06-26T11:46:34Z

readthedocs_ext/readthedocs.py

        log.exception(
            'Failure in JSON search dump for page {page}'.format(page=outjson)
        )


+def dump_sphinx_data(app, exception):
+    """
+    Dump a bunch of additional Sphinx data that is useful during search indexing


Docstring could be expanded with the output structure expected of the dict mentioning what is each field and some examples about how are the values (paths --to see if they are relative/absolute, etc) for them.

humitos · 2019-06-26T11:47:09Z

readthedocs_ext/readthedocs.py

        log.exception(
            'Failure in JSON search dump for page {page}'.format(page=outjson)
        )


+def dump_sphinx_data(app, exception):


I think we should handle the case where exception is not None and just return.

humitos · 2019-06-26T11:58:52Z

To be able to give better feedback, I'd like to know where this will be used (and how) so I can check that we have the data and the structure we need.

Nevermind. I found the exact place:

https://github.com/rtfd/readthedocs.org/blob/20b028521d97b797474b05ebe6b7fef9d84142a8/readthedocs/projects/tasks.py#L1291-L1390

humitos · 2019-06-26T12:17:49Z

As a general question, why don't we save everything that we need in our own JSON created by this extension instead of mixing some data from here and some data parsed from objects.inv inside the Django App? I find the current approach complex.

It seems that we are processing the data in two different places when we can do it only in this extension and then only use it from the Django App to create the SphinxDomain objects.

Also, I didn't find where the paths key is used.

ericholscher · 2019-06-27T23:01:02Z

Also, I didn't find where the paths key is used.

I'd like to get paths stored on the HTMLFile object I think, in order to keep track of where a file came from.

ericholscher · 2019-06-27T23:02:22Z

It seems that we are processing the data in two different places when we can do it only in this extension and then only use it from the Django App to create the SphinxDomain objects.

Agreed. This is probably a better path forward. I will think about if we should do it now, or later.

humitos

Enough to be merged and start using it :)

We can improve it later if we want. In particular the processing of the data in multiple places. Also, there are some refactor that can be done.

ericholscher · 2019-07-02T18:28:10Z

The main thing w/ parsing the data multiple places is all the old repos that we need to support. This will only work for indexing going forward, but objects.inv will always exist.

ericholscher added 6 commits August 1, 2018 15:25

Futz

fb8bffb

Add dumping of domain data

ad9f44e

Remove ipdb

dfd003f

Testing

4553a55

Testing

7d5d167

Dump more JSON data for indexing

6dd2934

ericholscher added the PR: work in progress label Feb 28, 2019

ericholscher mentioned this pull request Mar 1, 2019

Add search for DomainData objects readthedocs/readthedocs.org#5290

Merged

ericholscher added 4 commits March 1, 2019 11:54

Fix tests and remove unused logic

f1041fd

Remove old logic

e15ea7d

Comment

7a9df1d

A bit more explicit variable naming

1361752

ericholscher requested a review from a team March 1, 2019 14:59

Update travis branch

0d0656a

humitos reviewed Mar 4, 2019

View reviewed changes

stsewd approved these changes Mar 4, 2019

View reviewed changes

ericholscher added 3 commits June 25, 2019 15:42

Merge remote-tracking branch 'origin/master' into get-domain-data

b418908

Add test and fix review feedback

7d88737

Fix string

ee117e7

ericholscher requested a review from a team June 25, 2019 23:04

ericholscher removed the PR: work in progress label Jun 25, 2019

humitos reviewed Jun 26, 2019

View reviewed changes

ericholscher mentioned this pull request Jun 27, 2019

Search indexing with storage readthedocs/readthedocs.org#5854

Merged

ericholscher added 3 commits June 27, 2019 16:13

Address review feedback

7f8574c

A bit more feedback

3733654

A bit more review feedback

c13a5aa

humitos approved these changes Jul 1, 2019

View reviewed changes

ericholscher merged commit cad5845 into master Jul 2, 2019

stsewd deleted the get-domain-data branch January 14, 2021 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the amount of data we're saving during build #62

Increase the amount of data we're saving during build #62

ericholscher commented Feb 28, 2019

humitos left a comment

humitos Mar 4, 2019

ericholscher commented Mar 4, 2019

stsewd left a comment

stsewd Mar 4, 2019

stsewd Mar 4, 2019

humitos Jun 26, 2019

humitos commented Jun 10, 2019

ericholscher commented Jun 10, 2019

ericholscher commented Jun 25, 2019

humitos left a comment

humitos Jun 26, 2019

humitos Jun 26, 2019

humitos Jun 26, 2019

humitos Jun 26, 2019

humitos Jun 26, 2019

humitos commented Jun 26, 2019

humitos commented Jun 26, 2019

ericholscher commented Jun 27, 2019

ericholscher commented Jun 27, 2019

humitos left a comment

ericholscher commented Jul 2, 2019

Increase the amount of data we're saving during build #62

Increase the amount of data we're saving during build #62

Conversation

ericholscher commented Feb 28, 2019

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Mar 4, 2019

stsewd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Jun 10, 2019

ericholscher commented Jun 10, 2019

ericholscher commented Jun 25, 2019

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Jun 26, 2019

humitos commented Jun 26, 2019

ericholscher commented Jun 27, 2019

ericholscher commented Jun 27, 2019

humitos left a comment

Choose a reason for hiding this comment

ericholscher commented Jul 2, 2019