Skip to content

Index more domain data into elasticsearch #5979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Aug 27, 2019
Merged

Index more domain data into elasticsearch #5979

merged 36 commits into from
Aug 27, 2019

Conversation

dojutsu-user
Copy link
Member

@dojutsu-user dojutsu-user commented Jul 23, 2019

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good approach. I'd like to see some tests and a bit more detail in what all we're parsing for.

@dojutsu-user
Copy link
Member Author

@ericholscher
I have updated the PR.
Also, this PR also index sphinx domain signature content (Inside the <dd> tag).
I am not sure if it is needed or not.

I found some domains which don't have proper html strucutre.
Example: https://2.python-requests.org/en/master/api/#requests.cookies.RequestsCookieJar.popitem
In this.
Content inside <dt> tag: popitem() → (k, v), remove and return some (key, value) pair
Content inside <dd> tag: as a 2-tuple; but raise KeyError if D is empty.

So there's no proper domain signature and no proper domain doc strings.
What should we do in this case?

@dojutsu-user
Copy link
Member Author

@ericholscher

I'd like to see some tests

I will add the tests, once we are sure of the working.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I haven't run it yet, so we might hit some bugs mapping from real world HTML, but the structure looks great.

* remove  and  from docsearch API and start using .
* remove unnecessary fields from the PageDocument to index only relevant data into elasticsearch.
* Add  to search fields in faceted_Search.py.
* Don't get  from parse_json -- it is not required.
* Update  to show domains docstrings in results in  page.
@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jul 29, 2019

@ericholscher
Turns out that not every sphinx domain is inside <dl> tag.
For example: http://docs.celeryproject.org/en/latest/userguide/configuration.html#beat-scheduler
Therefore, the docstrings for these sphinx domains are not getting indexed. However these docstrings will be there in sections.content.

@ericholscher
Copy link
Member

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jul 29, 2019

@ericholscher
They are getting indexed as sphinx domains.

Screenshot from 2019-07-30 02-26-41

Screenshot from 2019-07-30 02-29-32

I believe, this sphinx domain is coming from this --
Screenshot from 2019-07-30 13-48-10

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions on this. I'm guessing the removed data is because we aren't displaying it in the search results anymore? I think we still likely want to display the type and name, no?

'anchor': fields.KeywordField(),

# For showing in the search result
'type_display': fields.TextField(),
'doc_display': fields.TextField(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need doc_display -- because the main title of each result contains the doc_display.
But I have undoed the change for type_display -- we can make use of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😕
I can't seem to find a good way to display every information.
type_display looks unncessary when we are displaying role_name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sure. They are showing the same information. I believe the type_display is meant for human readable output, and role_name is not, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are quite readable:
py:function
std:confval
py:class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think configuration value over confval is much better for non-technical users. We should definitely try to use the display value, that's the whole reason it exists if for this use.

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericholscher
I realised (while testing) that some sphinx domains don't have values for type_display.
Should we still show for those who have? or just keep showing role_name?


{% if inner_hit.source.display_name|length >= 1 %}
({{ inner_hit.source.role_name }}) {{ inner_hit.source.display_name}}
{% with "100" as MAX_SUBSTRING_LIMIT %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling the pain of maintaining multiple sets of result HTML. We should consider trying to get the site search using the search as you type code, hopefully with minimal changes from the theme code. That way we don't have 3 places to change HTML.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... maintaining multiple sets of html is definitely a pain.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

I think that's quite a big refactor and we can target that after this PR?

@ericholscher
Copy link
Member

This looks like a great approach. Let's see if we can get it deployed this week after we fix up tests. 👍

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to go ahead and merge this now that we have 3.7.3 out. Next deploy will ship this and require a reindex 👍

@ericholscher ericholscher merged commit 00ab116 into readthedocs:master Aug 27, 2019
@dojutsu-user dojutsu-user deleted the index-more-domain-data branch August 28, 2019 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: work in progress Pull request is not ready for full review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants