Skip to content

Search: weight page views into search results #7297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Jul 16, 2020

This is on top of #7293 since is needed to save the correct path into the page views.

Still needs to test this more locally, and write tests :D

I'm calculating the final weight based on two values:

  • absolute views: based on the views from the current page.
  • relative views: based on the relative popular number of views from the current version.

This allows us to prevent pages with a few views take over the results, but at the same time
popular pages from that particular project can still add some weight.

Things I want to test more:

  • Use the average instead of the max to get the value from relative views. (didn't see any improvement at first sight, but I added a note in case we want to try it)

This is under a feature flag to see if the db doesn't implode 💣

Design decisions after merging this PR:

  • Should this be enabled by default to all projects?
  • Should this be enabled per version or per project? (per project wouldn't require a rebuild and it lives in the admin, per version it would live in the config file and require a re-build)

Closes #5968

stsewd added 8 commits July 14, 2020 19:35
Page name is a sphinx only concept,
so we don't have correct data for page views from mkdocs projects.

Using the origin url + the unresolver gives us the correct path.
Also, we are always storing the full path (index.html instead of "/"),
so we don't have duplicates (let me know if we do want to have duplicates).
@stsewd stsewd added the PR: work in progress Pull request is not ready for full review label Jul 16, 2020
@stsewd stsewd removed the PR: work in progress Pull request is not ready for full review label Jul 20, 2020
@stsewd stsewd requested a review from a team July 20, 2020 19:42
Base automatically changed from page-views to master August 27, 2020 15:43
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have too much experience with ES, but this PR looks fine to me.

The only thing that I'm seeing is that even it's behind a feature flag, we are changing the ES query completely from what we currently have. So, we could break existing ES search for projects even if we don't enable the feature flag on them. We could keep the old _get_script_score code (without weight) for those projects that don't have the feature flag enabled; and execute the new ES code on those projects with the feature flag.

Example,

def _get_script_score():
  if has_feature(SEARCH_WEIGHT):
    return _get_script_score_with_weight()
  else:
    return _get_script_score_without_weight()  # old code

On the other hand, as a user, I think it's hard to tests how it behaves with and without the feature flag enabled if the user has to ask us to enable/disable it. If we want to have better feedback, we probably need to give the user the ability to enable/disable it by themselves. That way, you can perform a query, enable the feature, and repeat it. Repeating the process multiple times will give us a better understanding how the search perform on those projects.

Comment on lines +274 to +276
- absolute: this is equal to ``log10(views + 1)``
(we add one since logarithms start at 1).
A logarithmic function is a good fit due to its growth rate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This absolute value is across all the versions for the same page?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, here views is the number of views from that page in that version, this value isn't compared to anything (that's why it's absolute).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we have a page /api/v2.html that has a lot of views in version 1.0, then a new version of the docs is published, 1.1; the new page /1.1/api/v2.html won't be shown first in the results because it will have less views?

If that is correct, won't we be pointing the user always to old pages because they will have more views than the up-to-date version of the exact same page?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar example searching for the term API in 1.1 version of the docs. It may refers to another page (i.e. configuration.html) instead of api/v2.html because the newer version of configuration.html has more views than the newer version of api/v2.html --but historically (old versions) api/v2.hml has a lot more views. Makes sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I avoided merging versions since the content can be different, I think we should keep versions isolated. Another example is if a new api/v3.html is published, we don't want to list the results from the old api first.

@stsewd
Copy link
Member Author

stsewd commented Sep 9, 2020

The only thing that I'm seeing is that even it's behind a feature flag, we are changing the ES query completely from what we currently have.

We are adding the extra bits to get the page views, the other logic is the same (first two lines). Projects that don't have the flag will default to 0 for the value of page views, we add the values from page views and ranking so 0 doesn't change the final result.

@humitos
Copy link
Member

humitos commented Sep 10, 2020

That's what I'm saying. There is a completely different new script.source ES code executed even for projects without the new feature flag (old code was only 2 lines). We should keep executing the old ES code that we are 100% sure it works for projects without the flag, IMO

@stsewd
Copy link
Member Author

stsewd commented Sep 10, 2020

We are not changing or deleting the old logic, tests are passing too. I don't think is worth having two scripts for this change.

@stale
Copy link

stale bot commented Oct 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Oct 31, 2020
@stsewd stsewd removed the Status: stale Issue will be considered inactive soon label Nov 2, 2020
@stale
Copy link

stale bot commented Dec 18, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Dec 18, 2020
@ericholscher ericholscher added the Accepted Accepted issue on our roadmap label Dec 18, 2020
@stale stale bot removed the Status: stale Issue will be considered inactive soon label Dec 18, 2020
@humitos
Copy link
Member

humitos commented Mar 7, 2023

@stsewd what's the status of this PR? @ericholscher is this PR aligned with our vision of the search in the future? In that case, should we prioritize it, or close it?

@ericholscher
Copy link
Member

ericholscher commented Mar 7, 2023

I think this is useful, but if it's super complex, I don't think it's worth adding a bunch of complexity to our search to support it, especially when we don't really know if users want this or not.

Also curious to hear @stsewd's thoughts on the value/complexity trade off here.

@stsewd
Copy link
Member Author

stsewd commented Mar 7, 2023

I'm fine closing this, I think our current search ranking works well https://docs.readthedocs.io/en/stable/config-file/v2.html#search, sorting by popularity has other problems #5968 (comment).

@stsewd stsewd closed this Mar 7, 2023
@stsewd stsewd deleted the search-with-page-views branch March 7, 2023 16:04
@ericholscher
Copy link
Member

Should we also close the issue, then?

@stsewd
Copy link
Member Author

stsewd commented Mar 7, 2023

Should we also close the issue, then?

I'm fine with that too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Order search results by most viewed pages
3 participants