Skip to content

Refactor search code #5197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Feb 6, 2019
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
47e3c1a
Reactor search code
ericholscher Jan 29, 2019
a84bb15
Show more results in JS search results
ericholscher Jan 29, 2019
6f36acd
Fix docstring
ericholscher Jan 29, 2019
0bfa998
Fix lint
ericholscher Jan 29, 2019
936465d
Fix tests
ericholscher Jan 29, 2019
0496cf6
Small fixes for search
ericholscher Jan 29, 2019
ab28491
Fix comment.
ericholscher Jan 29, 2019
617f47f
Move to pformat and real logging.
ericholscher Jan 29, 2019
d0f6082
Standardize search result listings
ericholscher Jan 29, 2019
af8a28f
Only request 3 fragments, as that’s all we display
ericholscher Jan 29, 2019
5f197ec
Fix pytest test logic.
ericholscher Jan 29, 2019
465628e
Handle changing newlines to periods.
ericholscher Jan 30, 2019
e3e245c
Fix tests.
ericholscher Jan 30, 2019
e909759
Check for content to highlight
ericholscher Jan 30, 2019
49098d7
Attempt to fix tests agian.
ericholscher Jan 30, 2019
07dc26b
Delete some old code, and remove single function/class files.
ericholscher Jan 30, 2019
e9f3f3f
Fix lint error.
ericholscher Jan 30, 2019
d6d02da
Keep all search views in the search app
ericholscher Jan 30, 2019
336e674
remove need to pass around an index_name
ericholscher Jan 30, 2019
eb65817
Properly filter project search
ericholscher Jan 30, 2019
144ed0e
Refactor the Document and FacetedSearch classes
ericholscher Jan 31, 2019
d314f58
Merge remote-tracking branch 'origin/master' into readd-search-signals
ericholscher Jan 31, 2019
03ad482
Fix lint
ericholscher Jan 31, 2019
5dfc17a
Add TODO for signal handling
ericholscher Feb 4, 2019
fcc9620
Use old value for ELASTICSEARCH_DSL_AUTOSYNC in tests
ericholscher Feb 4, 2019
701ecdb
Update JS file
ericholscher Feb 5, 2019
f636737
Add logging to our search JS to debug
ericholscher Feb 5, 2019
7583f9a
Make log a bit better
ericholscher Feb 5, 2019
747da90
Show the actual result length instead of the API count, since it’s wr…
ericholscher Feb 5, 2019
6c317f1
Update bundle with this change.
ericholscher Feb 5, 2019
e2655de
Remove use of signals in favor of SettingsOverrideObject
ericholscher Feb 5, 2019
f457ff7
Simpler super's
ericholscher Feb 5, 2019
7b2013c
Docstring for some methods/functions and small linting
humitos Feb 5, 2019
0741a25
Fix search lint
ericholscher Feb 5, 2019
85956e0
Comment the proper place
ericholscher Feb 5, 2019
6ac2f35
Merge branch 'readd-search-signals' of github.com:rtfd/readthedocs.or…
humitos Feb 5, 2019
e062191
Merge remote-tracking branch 'origin/master' into readd-search-signals
ericholscher Feb 5, 2019
8e4cc2b
Nicer highlight replacement syntax
ericholscher Feb 5, 2019
8c7bda4
Remove search debug logging.
ericholscher Feb 5, 2019
5b9f460
Use normal user object everywhere.
ericholscher Feb 5, 2019
e2e271b
Merge remote-tracking branch 'origin/readd-search-signals' into readd…
ericholscher Feb 5, 2019
a993f08
Fix typo
ericholscher Feb 5, 2019
d52b968
Use classic JS loop
ericholscher Feb 5, 2019
fc277fa
Update docs
ericholscher Feb 5, 2019
417ea45
Cap operators
ericholscher Feb 5, 2019
80c58c7
Fix lint again
ericholscher Feb 5, 2019
72d867f
Once more with the linting
ericholscher Feb 5, 2019
5f118ff
Change API queryset filter to `public(user)`
ericholscher Feb 6, 2019
e263e69
Small doc fixup
ericholscher Feb 6, 2019
1a3e146
Support filter_user argument for not filtering users in corporate sea…
ericholscher Feb 6, 2019
fab9f42
Address review feedback
ericholscher Feb 6, 2019
0a06726
More cleanup.
ericholscher Feb 6, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions readthedocs/core/static-src/core/js/doc-embed/search.js
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,18 @@ function attach_elastic_search_query(data) {

// Show highlighted texts
if (highlight.content) {
var content_text = xss(highlight.content[0]);
var contents = $('<div class="context">');

contents.html(content_text);
contents.find('em').addClass('highlighted');
list_item.append(contents);
for (index in highlight.content) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, JS quirk here. for ... in isn't actually for iterables, it's for properties on an object. It's better to use the old for (var i; ...) approach for arrays:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...in

There is for ... of, which loops over iterables, but browser support is still new:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of#Browser_compatibility

if (index < 3) {
// Show up to 3 results for search
var content = highlight.content[index];
var content_text = xss(content);
var contents = $('<div class="context">');

contents.html("..." + content_text + "...");
contents.find('em').addClass('highlighted');
list_item.append(contents);
}
}
}

Search.output.append(list_item);
Expand Down
23 changes: 14 additions & 9 deletions readthedocs/projects/views/public.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import operator
import os
from collections import OrderedDict
from pprint import pformat

import requests
from django.conf import settings
Expand Down Expand Up @@ -247,10 +248,23 @@ def elastic_project_search(request, project_slug):
version_slug = request.GET.get('version', LATEST)
query = request.GET.get('q', None)
results = None

if query:
kwargs = {}
kwargs['projects_list'] = [project.slug]
kwargs['versions_list'] = version_slug
user = ''
if request.user.is_authenticated:
user = request.user

page_search = PageDocument.faceted_search(
query=query, user=user, **kwargs
)
results = page_search.execute()

log.debug('Search results: %s', pformat(results.to_dict()))
log.debug('Search facets: %s', pformat(results.facets.to_dict()))

log.info(
LOG_TEMPLATE.format(
user=user,
Expand All @@ -262,15 +276,6 @@ def elastic_project_search(request, project_slug):
),
)

if query:
req = PageDocument.simple_search(query=query)
filtered_query = (
req.filter('term', project=project.slug)
.filter('term', version=version_slug)
)
paginated_query = filtered_query[:50]
results = paginated_query.execute()

return render(
request,
'search/elastic_project_search.html',
Expand Down
22 changes: 15 additions & 7 deletions readthedocs/search/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,13 @@
from rest_framework.exceptions import ValidationError

from readthedocs.search.documents import PageDocument
from readthedocs.search.filters import SearchFilterBackend
from readthedocs.search.pagination import SearchPagination
from readthedocs.search.serializers import PageSearchSerializer
from readthedocs.search.utils import get_project_list_or_404


class PageSearchAPIView(generics.ListAPIView):
pagination_class = SearchPagination
filter_backends = [SearchFilterBackend]
serializer_class = PageSearchSerializer

def get_queryset(self):
Expand All @@ -24,7 +22,15 @@ def get_queryset(self):
# Validate all the required params are there
self.validate_query_params()
query = self.request.query_params.get('q', '')
queryset = PageDocument.simple_search(query=query)
kwargs = {}
kwargs['projects_list'] = [p.slug for p in self.get_all_projects()]
kwargs['versions_list'] = self.request.query_params.get('version')
user = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a better pattern to default to the AnonymousUser instead. So, anywhere where this is used, all the method are still available and returning the proper values.

I think you can pass self.request.user directly without checking anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here, unless there is some significance to ES handling the user as an empty string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, was just the old way we were doing it. Fixed it now.

if self.request.user.is_authenticated:
user = self.request.user
queryset = PageDocument.faceted_search(
query=query, user=user, **kwargs
)
return queryset

def validate_query_params(self):
Expand All @@ -43,13 +49,15 @@ def get_serializer_context(self):
context['projects_url'] = self.get_all_projects_url()
return context

def get_all_projects_url(self):
version_slug = self.request.query_params.get('version')
def get_all_projects(self):
project_slug = self.request.query_params.get('project')
all_projects = get_project_list_or_404(project_slug=project_slug, user=self.request.user)
projects_url = {}
return all_projects

def get_all_projects_url(self):
all_projects = self.get_all_projects()
version_slug = self.request.query_params.get('version')
projects_url = {}
for project in all_projects:
projects_url[project.slug] = project.get_docs_url(version_slug=version_slug)

return projects_url
34 changes: 11 additions & 23 deletions readthedocs/search/documents.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,19 @@ class Meta(object):
})
language = fields.KeywordField()

# Fields to perform search with weight
search_fields = ['name^5', 'description']

@classmethod
def faceted_search(cls, query, language=None, using=None, index=None):
def faceted_search(cls, query, user, language=None, using=None, index=None):
kwargs = {
'user': user,
'using': using or cls._doc_type.using,
'index': index or cls._doc_type.index,
'doc_types': [cls],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericholscher I think we can pass the doc_type from here to avoide the lazy import.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a peek at that in a refactor, I think it's OK for now.

'model': cls._doc_type.model,
'query': query
'query': query,
'fields': cls.search_fields
}

if language:
Expand Down Expand Up @@ -69,9 +74,12 @@ class Meta(object):
'search/index.html', 'genindex/index.html', 'py-modindex/index.html']

@classmethod
def faceted_search(cls, query, projects_list=None, versions_list=None, using=None, index=None):
def faceted_search(
cls, query, user, projects_list=None, versions_list=None, using=None, index=None
):
es_query = cls.get_es_query(query=query)
kwargs = {
'user': user,
'using': using or cls._doc_type.using,
'index': index or cls._doc_type.index,
'doc_types': [cls],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, pass the doc_types in order to avoide the lazy import

Expand All @@ -90,26 +98,6 @@ def faceted_search(cls, query, projects_list=None, versions_list=None, using=Non

return FileSearch(**kwargs)

@classmethod
def simple_search(cls, query, using=None, index=None):
"""
Do a search without facets.

This is used in:

* The Docsearch API
* The Project Admin Search page
"""

es_search = cls.search(using=using, index=index)
es_search = es_search.highlight_options(encoder='html')

es_query = cls.get_es_query(query=query)
highlighted_fields = [f.split('^', 1)[0] for f in cls.search_fields]
es_search = es_search.query(es_query).highlight(*highlighted_fields)

return es_search

@classmethod
def get_es_query(cls, query):
"""Return the Elasticsearch query generated from the query string"""
Expand Down
53 changes: 39 additions & 14 deletions readthedocs/search/faceted_search.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
import logging

from elasticsearch_dsl import FacetedSearch, TermsFacet
from readthedocs.search.signals import before_file_search, before_project_search

log = logging.getLogger(__name__)


class RTDFacetedSearch(FacetedSearch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class if it's not used here, should probably lives on corporate repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh? It's definitely used on the .org.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. Got confused by the comment.

Expand All @@ -8,7 +13,8 @@ class RTDFacetedSearch(FacetedSearch):
# TODO: Remove the overwrite when the elastic/elasticsearch-dsl-py#916
# See more: https://github.com/elastic/elasticsearch-dsl-py/issues/916

def __init__(self, using, index, doc_types, model, fields=None, **kwargs):
def __init__(self, user, using, index, doc_types, model, fields=None, **kwargs):
self.user = user
self.using = using
self.index = index
self.doc_types = doc_types
Expand All @@ -17,28 +23,47 @@ def __init__(self, using, index, doc_types, model, fields=None, **kwargs):
self.fields = fields
super(RTDFacetedSearch, self).__init__(**kwargs)

def search(self):
"""
Filter out full content on search results

This was causing all of the indexed content to be returned,
which was never used on the client side.
"""
s = super().search()
s = s.source(exclude=['content', 'headers'])
resp = self.signal.send(sender=self, user=self.user, search=s)
if resp:
# Signal return a search object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its not appropiate approach to fire a signal and use the returned object.
The signal can be listened by more than one function. In that case this will break the functionality and it will be hard to debug this kind of errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some docs about this: https://docs.djangoproject.com/en/1.11/topics/signals/#django.dispatch.Signal.send

We should be careful about how we manipulate the result considering that it's a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I tried to manipulate the object that was passed, but it didn't work in my testing. I'm not sure another good way to implement this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the line resp[0][1] will break if there are more signals connected to before_file_search signal, since the result will depend on which one is executed first.

Why are we using a signal for this? I suppose that it's because we need to override this on .com. In that case, isn't it better to make it a method and override the class using our SettingsOverride system?

I found that this signal is connected only to one receiver on each project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, having a class override on the .com is probably a cleaner solution here. I will work on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed up a new version with this approach.

try:
s = resp[0][1]
except AttributeError:
log.exception('Failed to return a search object from search signals')
return s

def query(self, search, query):
"""
Add query part to ``search`` when needed

Also does HTML encoding of results to avoid XSS issues.

"""
search = search.highlight_options(encoder='html', number_of_fragments=3)
if not isinstance(query, str):
search = search.query(query)
return search


class ProjectSearch(RTDFacetedSearch):
fields = ['name^5', 'description']
facets = {
'language': TermsFacet(field='language')
}
signal = before_project_search


class FileSearch(RTDFacetedSearch):
facets = {
'project': TermsFacet(field='project'),
'version': TermsFacet(field='version')
}

def query(self, search, query):
"""
Add query part to ``search``

Overriding because we pass ES Query object instead of string
"""
search = search.highlight_options(encoder='html')
if query:
search = search.query(query)

return search
signal = before_file_search
20 changes: 0 additions & 20 deletions readthedocs/search/filters.py

This file was deleted.

9 changes: 8 additions & 1 deletion readthedocs/search/serializers.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
import logging
from pprint import pformat

from rest_framework import serializers

log = logging.getLogger(__name__)


class PageSearchSerializer(serializers.Serializer):
project = serializers.CharField()
Expand All @@ -18,4 +23,6 @@ def get_link(self, obj):
def get_highlight(self, obj):
highlight = getattr(obj.meta, 'highlight', None)
if highlight:
return highlight.to_dict()
ret = highlight.to_dict()
log.debug('API Search highlight: %s', pformat(ret))
return ret
4 changes: 2 additions & 2 deletions readthedocs/search/signals.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,16 @@

from readthedocs.projects.models import HTMLFile, Project
from readthedocs.projects.signals import bulk_post_create, bulk_post_delete
from readthedocs.search.documents import PageDocument, ProjectDocument
from readthedocs.search.tasks import index_objects_to_es


before_project_search = django.dispatch.Signal(providing_args=['body'])
before_file_search = django.dispatch.Signal(providing_args=['body'])
before_section_search = django.dispatch.Signal(providing_args=['body'])


@receiver(bulk_post_create, sender=HTMLFile)
def index_html_file(instance_list, **_):
from readthedocs.search.documents import PageDocument
kwargs = {
'app_label': HTMLFile._meta.app_label,
'model_name': HTMLFile.__name__,
Expand All @@ -42,6 +41,7 @@ def remove_html_file(instance_list, **_):

@receiver(post_save, sender=Project)
def index_project(instance, *args, **kwargs):
from readthedocs.search.documents import ProjectDocument
kwargs = {
'app_label': Project._meta.app_label,
'model_name': Project.__name__,
Expand Down
4 changes: 2 additions & 2 deletions readthedocs/search/tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
@pytest.mark.search
class TestDocumentSearch(object):

def __init__(self):
# This reverse needs to be inside the ``__init__`` method because from
def setUp(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search/tests/test_api.py:13
  /Users/eric/projects/readthedocs.org/readthedocs/search/tests/test_api.py:13: PytestWarning: cannot collect test class 'TestDocumentSearch' because it has a __init__ constructor

# This reverse needs to be inside the ``setUp`` method because from
# the Corporate site we don't define this URL if ``-ext`` module is not
# installed
self.url = reverse('doc_search')
Expand Down
4 changes: 2 additions & 2 deletions readthedocs/search/tests/test_faceted_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def test_search_exact_match(self, client, project, case):
cased_query = getattr(query_text, case)
query = cased_query()

page_search = PageDocument.faceted_search(query=query)
page_search = PageDocument.faceted_search(query=query, user='')
results = page_search.execute()

assert len(results) == 1
Expand All @@ -37,7 +37,7 @@ def test_search_combined_result(self, client, project):
- Where `Foo` or `Bar` is present
"""
query = 'Official Support'
page_search = PageDocument.faceted_search(query=query)
page_search = PageDocument.faceted_search(query=query, user='')
results = page_search.execute()
assert len(results) == 3

Expand Down
11 changes: 1 addition & 10 deletions readthedocs/search/tests/test_xss.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,7 @@ class TestXSS:

def test_facted_page_xss(self, client, project):
query = 'XSS'
page_search = PageDocument.faceted_search(query=query)
results = page_search.execute()
expected = """
&lt;h3&gt;<em>XSS</em> exploit&lt;&#x2F;h3&gt;
""".strip()
assert results[0].meta.highlight.content[0][:len(expected)] == expected

def test_simple_page_xss(self, client, project):
query = 'XSS'
page_search = PageDocument.simple_search(query=query)
page_search = PageDocument.faceted_search(query=query, user='')
results = page_search.execute()
expected = """
&lt;h3&gt;<em>XSS</em> exploit&lt;&#x2F;h3&gt;
Expand Down
Loading