Index more domain data into elasticsearch #5979

dojutsu-user · 2019-07-23T09:25:28Z

WIP

Related PR in readthedocs-sphinx-search -- readthedocs/readthedocs-sphinx-search#42

ericholscher

This looks like a good approach. I'd like to see some tests and a bit more detail in what all we're parsing for.

readthedocs/search/parse_json.py

dojutsu-user · 2019-07-24T12:02:10Z

@ericholscher
I have updated the PR.
Also, this PR also index sphinx domain signature content (Inside the <dd> tag).
I am not sure if it is needed or not.

I found some domains which don't have proper html strucutre.
Example: https://2.python-requests.org/en/master/api/#requests.cookies.RequestsCookieJar.popitem
In this.
Content inside <dt> tag: popitem() → (k, v), remove and return some (key, value) pair
Content inside <dd> tag: as a 2-tuple; but raise KeyError if D is empty.

So there's no proper domain signature and no proper domain doc strings.
What should we do in this case?

dojutsu-user · 2019-07-24T12:02:57Z

@ericholscher

I'd like to see some tests

I will add the tests, once we are sure of the working.

ericholscher

This looks good. I haven't run it yet, so we might hit some bugs mapping from real world HTML, but the structure looks great.

readthedocs/search/parse_json.py

* remove and from docsearch API and start using . * remove unnecessary fields from the PageDocument to index only relevant data into elasticsearch. * Add to search fields in faceted_Search.py. * Don't get from parse_json -- it is not required. * Update to show domains docstrings in results in page.

dojutsu-user · 2019-07-29T08:32:15Z

@ericholscher
Turns out that not every sphinx domain is inside <dl> tag.
For example: http://docs.celeryproject.org/en/latest/userguide/configuration.html#beat-scheduler
Therefore, the docstrings for these sphinx domains are not getting indexed. However these docstrings will be there in sections.content.

ericholscher · 2019-07-29T20:32:54Z

Those aren't in Sphinx Domain's: https://raw.githubusercontent.com/celery/celery/241d2e8ca85a87a2a6d01380d56eb230310868e3/docs/userguide/configuration.rst

dojutsu-user · 2019-07-29T21:04:38Z

@ericholscher
They are getting indexed as sphinx domains.

I believe, this sphinx domain is coming from this --

ericholscher

I have some questions on this. I'm guessing the removed data is because we aren't displaying it in the search results anymore? I think we still likely want to display the type and name, no?

readthedocs/core/static-src/core/js/doc-embed/search.js

readthedocs/search/documents.py

ericholscher · 2019-07-31T21:27:55Z

readthedocs/search/documents.py

            'anchor': fields.KeywordField(),

            # For showing in the search result
-            'type_display': fields.TextField(),
-            'doc_display': fields.TextField(),


Why are we removing this?

We don't need doc_display -- because the main title of each result contains the doc_display.
But I have undoed the change for type_display -- we can make use of this.

😕
I can't seem to find a good way to display every information.
type_display looks unncessary when we are displaying role_name.

Well, sure. They are showing the same information. I believe the type_display is meant for human readable output, and role_name is not, no?

Both are quite readable:
py:function
std:confval
py:class

I think configuration value over confval is much better for non-technical users. We should definitely try to use the display value, that's the whole reason it exists if for this use.

@ericholscher
I realised (while testing) that some sphinx domains don't have values for type_display.
Should we still show for those who have? or just keep showing role_name?

ericholscher · 2019-07-31T21:30:49Z

readthedocs/templates/search/elastic_search.html

-
-                            {% if inner_hit.source.display_name|length >= 1 %}
-                              ({{ inner_hit.source.role_name }}) {{ inner_hit.source.display_name}}
+                    {% with "100" as MAX_SUBSTRING_LIMIT %}


I'm feeling the pain of maintaining multiple sets of result HTML. We should consider trying to get the site search using the search as you type code, hopefully with minimal changes from the theme code. That way we don't have 3 places to change HTML.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

Yes... maintaining multiple sets of html is definitely a pain.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

I think that's quite a big refactor and we can target that after this PR?

ericholscher · 2019-08-19T18:34:49Z

This looks like a great approach. Let's see if we can get it deployed this week after we fix up tests. 👍

ericholscher

Going to go ahead and merge this now that we have 3.7.3 out. Next deploy will ship this and require a reindex 👍

Index more domain data

7b3d3c3

dojutsu-user added the PR: work in progress Pull request is not ready for full review label Jul 23, 2019

dojutsu-user mentioned this pull request Jul 23, 2019

Dump sphinx domains docstrings readthedocs/readthedocs-sphinx-ext#74

Closed

dojutsu-user requested review from ericholscher and davidfischer July 23, 2019 09:41

dojutsu-user added 4 commits July 24, 2019 00:02

update json parsing to capture sphinx data

f1f522b

remove redundant import

20777cb

add comments

d97738f

fix type

c00575e

ericholscher reviewed Jul 23, 2019

View reviewed changes

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved

readthedocs/search/parse_json.py Show resolved Hide resolved

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved

dojutsu-user added 2 commits July 24, 2019 14:13

Merge branch 'master' into index-more-domain-data

e3a65dc

get domain signature alos

f57afe1

dojutsu-user requested a review from ericholscher July 24, 2019 19:29

ericholscher reviewed Jul 26, 2019

View reviewed changes

readthedocs/search/parse_json.py Show resolved Hide resolved

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved

dojutsu-user added 9 commits July 26, 2019 21:10

Merge branch 'master' into index-more-domain-data

7a5e387

correct documents.py

9362b64

don't reindex domain data multiple times

3255f84

use try...except a little lower

bd53562

enable highlighting in main site file search

b6221c6

small fix in parse_json

5f186e7

Fix main site search to show sphinx domain docstrings

301b2af

remove _get_domain_data()

9e2d497

dojutsu-user mentioned this pull request Jul 29, 2019

Display domain docstrings in the search results readthedocs/readthedocs-sphinx-search#42

Merged

dojutsu-user requested a review from ericholscher July 29, 2019 09:45

Merge branch 'master' into index-more-domain-data

597a59b

dojutsu-user added 5 commits July 30, 2019 02:48

don't index toctree elements

849ba4d

add try...except

6501e7c

fix small error in parse_json

8e0fe5b

Update match query

8bb25b9

update parse_json to delete only specific <dl> tags

8cc8999

ericholscher reviewed Jul 31, 2019

View reviewed changes

dojutsu-user added 7 commits August 5, 2019 13:53

Restore path and link

090f187

index type_display

d456bf0

add comment back

5c74690

Merge branch 'master' into index-more-domain-data

17c88aa

update min files

28ec7aa

Merge branch 'master' into index-more-domain-data

2081269

update minified files and parse_json

2fe0d0a

dojutsu-user added 7 commits August 20, 2019 00:20

Merge branch 'master' into index-more-domain-data

6a9db48

fix lint

e61fd36

update json data

b597da9

fix tests

5c3f71b

update json data

7d1ed0c

fix tests

9c9f03f

Merge branch 'master' into index-more-domain-data

4beee77

dojutsu-user requested a review from ericholscher August 21, 2019 10:15

dojutsu-user mentioned this pull request Aug 26, 2019

Add Search Guide #6101

Merged

ericholscher approved these changes Aug 27, 2019

View reviewed changes

ericholscher merged commit 00ab116 into readthedocs:master Aug 27, 2019

dojutsu-user deleted the index-more-domain-data branch August 28, 2019 04:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index more domain data into elasticsearch #5979

Index more domain data into elasticsearch #5979

dojutsu-user commented Jul 23, 2019 •

edited

Loading

ericholscher left a comment

dojutsu-user commented Jul 24, 2019

dojutsu-user commented Jul 24, 2019

ericholscher left a comment

dojutsu-user commented Jul 29, 2019 •

edited

Loading

ericholscher commented Jul 29, 2019

dojutsu-user commented Jul 29, 2019 •

edited

Loading

ericholscher left a comment

ericholscher Jul 31, 2019

dojutsu-user Aug 5, 2019

dojutsu-user Aug 5, 2019

ericholscher Aug 19, 2019

dojutsu-user Aug 19, 2019

ericholscher Aug 19, 2019

dojutsu-user Aug 21, 2019 •

edited

Loading

ericholscher Jul 31, 2019

dojutsu-user Aug 5, 2019

ericholscher commented Aug 19, 2019

ericholscher left a comment

Index more domain data into elasticsearch #5979

Index more domain data into elasticsearch #5979

Conversation

dojutsu-user commented Jul 23, 2019 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

dojutsu-user commented Jul 24, 2019

dojutsu-user commented Jul 24, 2019

ericholscher left a comment

Choose a reason for hiding this comment

dojutsu-user commented Jul 29, 2019 • edited Loading

ericholscher commented Jul 29, 2019

dojutsu-user commented Jul 29, 2019 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dojutsu-user Aug 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Aug 19, 2019

ericholscher left a comment

Choose a reason for hiding this comment

dojutsu-user commented Jul 23, 2019 •

edited

Loading

dojutsu-user commented Jul 29, 2019 •

edited

Loading

dojutsu-user commented Jul 29, 2019 •

edited

Loading

dojutsu-user Aug 21, 2019 •

edited

Loading