Skip to content

Search: Allow authors to set a "search score" per pages #7082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stsewd opened this issue May 14, 2020 · 25 comments · Fixed by #7237
Closed

Search: Allow authors to set a "search score" per pages #7082

stsewd opened this issue May 14, 2020 · 25 comments · Fixed by #7237
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required

Comments

@stsewd
Copy link
Member

stsewd commented May 14, 2020

The use case is basically when users want to deprecate something but still don't want to delete that content (like api v1 vs v2).

Or in our case where we have the design docs, and we rank those results first rather than the actual content.

This is related to #5968, but I think making this explicitly is better, since probably docs from the v1 could have more views, but we want users to start using v2 (new pages!).

I'm not sure the best way to do this.
We could allow users to add this in a meta tag search-ranking=x or search-score=x, x could be an integer >= 0. This only allow users to rank content per page, not per sections, but I think that's enough.

@stsewd stsewd added Improvement Minor improvement to code Needed: design decision A core team decision is required labels May 14, 2020
@stsewd stsewd changed the title Search: Allow authors to set a relevance "ranking" Search: Allow authors to set a "search score" per pages May 14, 2020
@stsewd
Copy link
Member Author

stsewd commented May 15, 2020

We could also just use the robots metatag https://support.google.com/webmasters/answer/79812?hl=en

We still show results for those pages, but with lower priority.

@ericholscher
Copy link
Member

I think this is definitely a good feature. The more we can allow users to customize search the better.

Another one I've wanted is the ability to add "tags" or similar, so that we can boost pages for results on a search term. I don't think tags is the right name, but some concept that lets users say "I want this to rank highly for this search term".

@ltalirz
Copy link

ltalirz commented Jun 2, 2020

Just to mention that also in our case the API docs autogenerated with sphinx-apidoc often rank very highly in the search results, and it would be great to have a way of reducing their "search score".

@ericholscher
Copy link
Member

@ltalirz Thanks for the example! What would the best implementation be for you? Sounds like it would probably need to be path specific? eg. search_ranks = {'apidoc/*': 1, 'user-guide/*': 10} or something?

@ltalirz
Copy link

ltalirz commented Jun 9, 2020

Hi @ericholscher, thanks for following up on this and sorry for the late reply!

Sounds like it would probably need to be path specific?

Are you referring to the ability to cover directory subtrees instead of just individual pages?
Yes, that would be very useful.

eg. search_ranks = {'apidoc/': 1, 'user-guide/': 10} or something?

I think that would already work for us :-)

I'm not familiar with how elasticsearch handles relevance - what I understand from here is that pages are ranked by scores that are floating point numbers, and you can specify boosts of different kinds.
I.e. would it be more natural to use floating point numbers rather than ranks here?

Anyhow, for us it would not really matter - I suggest you decide by what makes the integration with elasticsearch as simple as possible.

@stsewd
Copy link
Member Author

stsewd commented Jun 24, 2020

I was able to implement this. The final number that ES requires needs to be greater than 0. Numbers less than 1, will make the search results for that page to appear down down, and a number greater than 1 will boost the results for that page.

We could expose that directly to users

version: 2

search:
  boosting:
    - api/v1/*: 0.5
    - api/v2/*: 2

Or have an internal range and map that to ES

version: 2

search:
  boosting:
    - api/v1/*: -1
    - api/v2/*: 2

e.g, -3, -2, -1, 0, 1, 2, 3 -> 0.3, 0.5, 0.7, 1, 3, 5, 7

an alternative name

version: 2

search:
  rank:
    - api/v1/*: -1
    - api/v2/*: 2

@ltalirz
Copy link

ltalirz commented Jun 24, 2020

This looks great!
From my perspective, floating point numbers are perfectly fine, and since it's called "boosting" in ES, I think it's fine to call it that here as well.
I.e. I would vote for variant 1.

@ericholscher
Copy link
Member

I think passing along the actual values we plan to use in ES is probably not the best design. We likely want to be able to tune how these boost numbers interact with our other methods of optimization and boosting, so I think an abstract range that we translate is best. Something like a range from -10 to +10, where:

  • -10 maps to 0 or 0.01 if we can't set 0
  • -2 maps to 0.8
  • 2 maps to 1.2
  • 9 maps to 1.9
  • 10 maps to 2

Or something like this.

@stsewd
Copy link
Member Author

stsewd commented Jul 2, 2020

This should be out by next week, in the meantime you can check the docs for the new option at https://docs.readthedocs.io/en/latest/config-file/v2.html#search

@stsewd
Copy link
Member Author

stsewd commented Jul 7, 2020

This is live now, you can set the custom ranking!

Let us know if it works as expected, we still can tweak it a little more.

@ltalirz
Copy link

ltalirz commented Jul 7, 2020

@stsewd Thanks, I've just given it a try, but it seems to have the opposite effect compared to the one I expected:

I've set

search:
  ranking:
    reference/apidoc/*: -5

with the goal of moving hits from the APIdoc further down in the results.

This is the original search ranking: https://aiida.readthedocs.io/projects/aiida-core/en/latest/search.html?q=workflows
This is the new one: https://aiida.readthedocs.io/projects/aiida-core/en/fix-search-rank/search.html?q=workflows

The new search ranking seems to consistently rank pages from the APIdoc higher than the old one (also for other search terms).
I'm not very familiar with what "rank" actually means in this context, perhaps someone could elaborate a bit?

@stsewd
Copy link
Member Author

stsewd commented Jul 7, 2020

I can see the correct results actually.

no rank

Screenshot_2020-07-07 Search — AiiDA 1 3 0 documentation

custom rank

Screenshot_2020-07-07 Search — AiiDA 1 3 0 documentation(1)

Results from the apidoc/ dir are really down the results list

@ltalirz
Copy link

ltalirz commented Jul 7, 2020

Wait, this may be a different issue - when I click on the second link
https://aiida.readthedocs.io/projects/aiida-core/en/fix-search-rank/search.html?q=workflows
(with ranking specified), I get this:

image

I.e. not only is the result order different from the one you see, but it somehow finds more than twice the number of pages!
I've tried Chrome and Safari, i.e. it does not seem to be an issue of browser caching.
Any ideas?

You created your second screenshot from clicking on the link above correct?

P.S. In case it helps, this is the branch from which the docs are generated: https://github.com/aiidateam/aiida-core/tree/fix-search-rank, with the search rank being added in the last commit aiidateam/aiida-core@3e7223e

@stsewd
Copy link
Member Author

stsewd commented Jul 7, 2020

@ltalirz that looks like it's rendering the default results from sphinx. See if you have an ad blocker installed, it may be blocking our override from https://assets.readthedocs.org/static/javascript/readthedocs-doc-embed.js

You created your second screenshot from clicking on the link above correct?

yeah, I'm on firefox

@ltalirz
Copy link

ltalirz commented Jul 7, 2020

I did have an adblocker on, but I'm getting the same results as before after disabling it + I get the same on firefox and safari where I don't have adblockers installed.

Can it be a geographical region thing? I'm in Switzerland.

@ltalirz
Copy link

ltalirz commented Jul 7, 2020

I do have a couple of 404s in the browser console, though (and only for the branch with updated settings):
image

Did we mess up something in our docs?
Still, why are you seeing the correct results then?

@stsewd
Copy link
Member Author

stsewd commented Jul 7, 2020

ha, that looks related to the CDN. Can you try with https://aiida.readthedocs.io/projects/aiida-core/en/fix-search-rank/search.html?q=workflows&foo=bar ? Also, you can try re-building the version in case the CDN failed or something in the previous build

@ltalirz
Copy link

ltalirz commented Jul 7, 2020

Thanks, with the link you provided, the results match your screenshot.
Does this mean some weird caching was going on at the CDN layer?

I'll rebuild the version now to see whether it fixes the link also without &foo

@stsewd
Copy link
Member Author

stsewd commented Jul 7, 2020

Does this mean some weird caching was going on at the CDN layer?

Yeah, maybe the cache is taking longer to purge in Switzerland, or something broke in the previous build.
The cache gets cleared automatically after one hour in case something failed, so this problem would have been disappeared after that.

@ltalirz
Copy link

ltalirz commented Jul 7, 2020

The cache gets cleared automatically after one hour in case something failed, so this problem would have been disappeared after that.

Do you suspect something failed?
Or is the problem perhaps that there was no failure detected and therefore the cache was not cleared (perhaps merging two things together that don't belong together)?
Looking at the timestamp of my comment, my previous build was ~2h ago.

@ltalirz
Copy link

ltalirz commented Jul 7, 2020

Ok, after the rebuild also the original link displays as in your screenshot.
Sorry for the spam about this unrelated issue. It seems like the ranking works just fine :-) (thanks a lot for this, it will be very useful for us!)

@stsewd
Copy link
Member Author

stsewd commented Jul 7, 2020

Maybe another layer of cache (ISP?)? You can also check the age header if you find this in the future. Glad it's useful!

@chrisjsewell
Copy link
Contributor

@stsewd quick related question regarding search indexing/ranking, does https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#html-metadata feed into the RTD search rankings, and if so how?

e.g. if you added this example and searched for my-keyword would you expect "A heading" to show (what would be highlighted?) and, if so, should the keyword go before or after the heading:

.. meta::
    :keywords: my-keyword

A heading
=========

.. meta::
    :keywords: my-keyword

@ltalirz tried adding a keyword (above the title) in aiidateam/aiida-core#4217 (comment), but it didn't seem to make any difference

@stsewd
Copy link
Member Author

stsewd commented Jul 14, 2020

Currently, we don't process any metatags, but I can see that as a future improvement. Also, I think metatags are always rendered at the top of the html document.

@chrisjsewell
Copy link
Contributor

thanks for the quick reply @stsewd

Also, I think metatags are always rendered at the top of the html document.

yep that looks to be the case, @ltalirz FYI I see now that you can only use meta to apply to the whole page, and since this is already used previously on the page for groupath (that is rendered at the top of the document), the latter querybuilder one actually appears to be ignored 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants