Skip to content

Search: use multi-fields for Wildcard queries #7613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 9, 2021
Merged

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Oct 28, 2020

Wildcard queries are slow (on .org it returns 502, on .com since the db
is small it works just fine).

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html#query-dsl-allow-expensive-queries

These type of queries can be optimized by using the Wildcard field
https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type.

Currently that field isn't implemented in the ES dependencies,
but is just a subclass of the Keyword field (I'll see if I can send a PR
upstream).

As we still want to make use of the SimpleString queries,
I'm using the multi-fields feature.

This change is editing the index,
so we need to rebuild the index and re-index.

To test it locally:

  • inv docker.manage 'search_index --rebuild'.
  • inv docker.manage reindex_elasticsearch.
  • Enable the DEFAULT_TO_FUZZY_SEARCH feature flag on a project.

I'm not sure how to write tests for this one,
and we can't test if this is really fast in production...
But testing it locally works and gives better results for both, sections and
domains!

This is based on #7582 since the wildcard field was added in ES 7.x.
As we are going to re-index when deploying 7.x, I think we can deploy this change together with that (otherwise we'll need to do the re-index again).

One example where you can see better results from domains (with the wildcard search there is a match in the title instead of matching the content)

Screenshot_2020-10-28 Read the Docs Documentation Simplified — Read the Docs 5 6 1 documentation(1)
Screenshot_2020-10-28 Read the Docs Documentation Simplified — Read the Docs 5 6 1 documentation

@stsewd stsewd changed the title Use multi fields Search: use multi-fields for Wildcard queries Oct 28, 2020
@stsewd stsewd changed the base branch from master to search-7.x October 28, 2020 18:40
@stsewd stsewd added the Status: blocked Issue is blocked on another issue label Oct 28, 2020
Wildcard queries are slow (on .org it returns 502, on .com since the db
is small it works just fine).

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html#query-dsl-allow-expensive-queries

These type of queries can be optimized by using the Wildcard field
https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type.

Currently that field isn't implemented in the ES dependencies,
but is just a subclass of the Keyword field (I'll see if I can send a PR
upstream).

As we still want to make use of the SimpleString queries,
I'm using the multi-fields feature.

This change is editing the index,
so we need to rebuild the index and re-index.

To test it locally:

- `inv docker.manage 'search_index --rebuild'`.
- `inv docker.manage reindex_elasticsearch`.
- Enable the `DEFAULT_TO_FUZZY_SEARCH` feature flag on a project.

I'm not sure how to write tests for this one,
and we can't test if this is really fast in production...
But testing it locally works and gives better results for both, sections and
domains!
@stsewd stsewd requested a review from a team October 28, 2020 22:37
Base automatically changed from search-7.x to master December 7, 2020 20:28
@stsewd stsewd removed the Status: blocked Issue is blocked on another issue label Dec 10, 2020
@stsewd
Copy link
Member Author

stsewd commented Dec 10, 2020

This isn't blocked anymore

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow too much the ES code for this change, but if we are worried about the queries being slow, shouldn't we put this behind a feature flag and roll it out slowly?

@stsewd
Copy link
Member Author

stsewd commented Jan 25, 2021

@humitos this feature is already in production under a feature flag, it's slow (doing a wildcard search on a no-wildcard field), that's why this PR, using a WildcardField should make it faster.

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Jun 2, 2021
@stale stale bot removed the Status: stale Issue will be considered inactive soon label Jun 7, 2021
@stsewd stsewd requested a review from ericholscher June 7, 2021 19:49
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change, but I assume requires a reindex? I'm a little worried about this, but I guess we can always just roll back to the old index if this breaks during deploy?


We use multi-fields to be able to perform other kind of queries over the same field.
``raw`` fields are used for Wildcard queries.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@stsewd
Copy link
Member Author

stsewd commented Jun 9, 2021

@ericholscher just tested this locally, we don't need to re-index during deploy, we can do it later, or even just rebuilding a project would create the new fields to make use of them (some fields won't be highlighted till we re-index but is under a feature flag, so all good). I think we could just re-build some projects to test the change, and if it looks good we can re-index everything to test it more broadly with the feature flag.

@stsewd stsewd merged commit b523a78 into master Jun 9, 2021
@stsewd stsewd deleted the use-multi-fields branch June 9, 2021 16:49
stsewd added a commit that referenced this pull request Jun 23, 2021
This is a partial revert of
#7613.
Turns out having more fields increase the size of the index,
and that makes all queries slower, even if they aren't using the new
field.
stsewd added a commit that referenced this pull request Jun 24, 2021
This is a partial revert of
#7613.
Turns out having more fields increase the size of the index,
and that makes all queries slower, even if they aren't using the new
field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants