Skip to content

Add section linking for the search result #5829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 70 commits into from
Jul 12, 2019
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
ee1ba1a
add sections field
dojutsu-user Jun 19, 2019
4b05f8a
index each section as separate document in ES
dojutsu-user Jun 21, 2019
79d2459
Merge branch 'master' into search-section-linking
dojutsu-user Jun 21, 2019
54ceb5c
few refactoring
dojutsu-user Jun 21, 2019
b11e357
revert all
dojutsu-user Jun 24, 2019
5b81471
Merge branch 'master' into search-section-linking
dojutsu-user Jun 24, 2019
762a79d
update document mapping (nested fields)
dojutsu-user Jun 24, 2019
7a61dbd
format text
dojutsu-user Jun 24, 2019
644565b
get results from inner_hits
dojutsu-user Jun 25, 2019
fa51a1c
Merge branch 'master' into search-section-linking
dojutsu-user Jun 25, 2019
0bc6be5
correct the none
dojutsu-user Jun 25, 2019
0139993
add field for PageDocument.
dojutsu-user Jun 26, 2019
53a02e8
remove domain_index settings
dojutsu-user Jun 26, 2019
11ba9e7
Merge branch 'htmlfile-sphinx-domain-integration' into search-section…
dojutsu-user Jun 26, 2019
6207f4e
remove SphinxDomainDocument and DomainSearch
dojutsu-user Jun 26, 2019
a251a98
generate correct query
dojutsu-user Jun 26, 2019
b6847b9
remove boosting and allsearch
dojutsu-user Jun 26, 2019
af2d69f
remove allsearch import
dojutsu-user Jun 26, 2019
32d0bed
recursively remove newline characters from highlight dict
dojutsu-user Jun 26, 2019
d472f29
Merge branch 'master' into search-section-linking
dojutsu-user Jun 27, 2019
878343d
lint fix
dojutsu-user Jun 27, 2019
f98d91c
Merge branch 'master' into search-section-linking
dojutsu-user Jun 28, 2019
7c1c641
set number_of_fragments to 1
dojutsu-user Jun 28, 2019
fd8e8f7
use nested facet
dojutsu-user Jun 28, 2019
f6221ec
get sorted results
dojutsu-user Jul 2, 2019
8840606
Merge branch 'master' into search-section-linking
dojutsu-user Jul 2, 2019
60e229c
fix search.html
dojutsu-user Jul 2, 2019
3835e2e
remove unused imports and add logging
dojutsu-user Jul 2, 2019
ae5033c
show more data on domain objects
dojutsu-user Jul 3, 2019
28e7cbf
fix main site search
dojutsu-user Jul 3, 2019
1e2a40b
mark as safe and change log to debug
dojutsu-user Jul 3, 2019
7b7a3c9
add transpiled files -- js
dojutsu-user Jul 3, 2019
3931bc0
remove log
dojutsu-user Jul 3, 2019
84a2494
small improvements in template
dojutsu-user Jul 3, 2019
5cae508
change variable name
dojutsu-user Jul 3, 2019
adb74ed
fix template
dojutsu-user Jul 4, 2019
d500d98
fix lint
dojutsu-user Jul 4, 2019
9461d4f
use python datatypes
dojutsu-user Jul 4, 2019
75dcc2f
remove highlight url param from sections and domains
dojutsu-user Jul 4, 2019
ea36138
fix clashing css classes
dojutsu-user Jul 4, 2019
451c0f4
Merge branch 'master' into search-section-linking
dojutsu-user Jul 8, 2019
0817d43
use underscore.js template
dojutsu-user Jul 8, 2019
5305458
add _ with variables
dojutsu-user Jul 8, 2019
68cb7af
add comment in template
dojutsu-user Jul 8, 2019
d62bf3e
use .iterator()
dojutsu-user Jul 8, 2019
ed16e56
show multiple results per section, if present
dojutsu-user Jul 9, 2019
0ed64f7
fix sphinx indexing
dojutsu-user Jul 9, 2019
f988302
don't index '-' value of domain.display_name
dojutsu-user Jul 9, 2019
429b3e9
fix eslint
dojutsu-user Jul 9, 2019
897e09f
Merge branch 'master' into search-section-linking
dojutsu-user Jul 9, 2019
aeaba6f
reduce complexity in search.js
dojutsu-user Jul 10, 2019
6f9b2bc
refactor tasks.py file
dojutsu-user Jul 10, 2019
6135cde
fix logic in search.views
dojutsu-user Jul 10, 2019
d3566ac
make 100 a constant
dojutsu-user Jul 10, 2019
992c72e
Add checkbox for searching in current section
dojutsu-user Jul 10, 2019
f0babf1
remove checkbox code for now
dojutsu-user Jul 10, 2019
4527839
Merge branch 'master' into search-section-linking
dojutsu-user Jul 10, 2019
7e75d7e
fix test_imported_file
dojutsu-user Jul 10, 2019
1e6721d
fix test_search_json_parsing
dojutsu-user Jul 10, 2019
2a4c070
fix test_search_json_parsing
dojutsu-user Jul 10, 2019
4beec39
update test_search_json_parsing
dojutsu-user Jul 10, 2019
01346a0
Merge branch 'master' into search-section-linking
dojutsu-user Jul 11, 2019
91282de
refactor parse_json and its test
dojutsu-user Jul 11, 2019
cfe8f5b
write initial tests
dojutsu-user Jul 11, 2019
7e99f6a
make 100 as constant
dojutsu-user Jul 11, 2019
b7ce777
fix lint
dojutsu-user Jul 12, 2019
6701a4e
add test for domains and filter by version and project
dojutsu-user Jul 12, 2019
cee24ed
revert changes to python_environments.py
dojutsu-user Jul 12, 2019
685f6db
remove tests from this pr
dojutsu-user Jul 12, 2019
d7edeee
update template to make 100 as constant
dojutsu-user Jul 12, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 77 additions & 50 deletions readthedocs/core/static-src/core/js/doc-embed/search.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

var rtddata = require('./rtd-data');
var xss = require('xss/lib/index');
var MAX_RESULT_PER_SECTION = 3;


/*
Expand Down Expand Up @@ -48,10 +49,10 @@ function attach_elastic_search_query(data) {
}

// Creating the result from elements
var link = doc.link + DOCUMENTATION_OPTIONS.FILE_SUFFIX +
'?highlight=' + $.urlencode(query);
var link = doc.link + DOCUMENTATION_OPTIONS.FILE_SUFFIX;
var highlight_link = link + "?highlight=" + $.urlencode(query);

var item = $('<a>', {'href': link});
var item = $('<a>', {'href': highlight_link});
item.html(title);
item.find('em').addClass('highlighted');
list_item.append(item);
Expand All @@ -68,86 +69,112 @@ function attach_elastic_search_query(data) {

var contents = $('<div class="context">');

var section_template = '' +
'<div>' +
'<a href="<%= section_subtitle_link %>">' +
'<%= section_subtitle %>' +
'</a>' +
'</div>' +
'<% for (var i = 0; i < section_content.length; ++i) { %>' +
'<div>' +
'<%= section_content[i] %>' +
'</div>' +
'<% } %>';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings are really cumbersome, is there not a good way to do multi-line strings in JS? :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfischer thoughts here? Is it too fancy and new?


var domain_template = '' +
'<div>' +
'<a href="<%= domain_subtitle_link %>">' +
'<%= domain_subtitle %>' +
'</a>' +
'</div>' +
'<span>' +
'<%= domain_content %>' +
'</span>';

// if the result is page section
if(inner_hits[j].type === "sections") {

var section = inner_hits[j];
var section_subtitle = $('<div class="rtd_search_subtitle">');
var section_subtitle_link = $('<a href="' + link + "#" + section._source.id + '">');
var section_content = $('<span>');
var section_subtitle = section._source.title;
var section_subtitle_link = link + "#" + section._source.id;
var section_content = [section._source.content.substring(0, 100) + " ..."];

if (section.highlight) {
if (section.highlight["sections.title"]) {
section_subtitle_link.html(
xss(section.highlight["sections.title"][0])
);
} else {
section_subtitle_link.html(
section._source.title
);
section_subtitle = xss(section.highlight["sections.title"][0]);
}

if (section.highlight["sections.content"]) {
section_content.html(
"... " +
xss(section.highlight["sections.content"][0]) +
" ..."
);
} else {
section_content.html(
section._source.content.substring(0, 100) +
" ..."
);
var content = section.highlight["sections.content"];
section_content = [];
for (
var k = 0;
k < content.length && k < MAX_RESULT_PER_SECTION;
k += 1
) {
section_content.push("... " + xss(content[k]) + " ...");
}
}
}

section_subtitle.html(section_subtitle_link);
contents.append(section_subtitle);
contents.append(section_content);
contents.find('em').addClass('highlighted');
contents.append(
$u.template(
section_template,
{
section_subtitle_link: section_subtitle_link,
section_subtitle: section_subtitle,
section_content: section_content
}
)
);
}

// if the result is a sphinx domain object
if (inner_hits[j].type === "domains") {

var domain = inner_hits[j];
var domain_subtitle = $('<div class="rtd_search_subtitle">');
var domain_subtitle_link = $('<a href="' + link + "#" + domain._source.anchor + '">');
var domain_content = $('<span>');
var domain_subtitle = domain._source.role_name;
var domain_subtitle_link = link + "#" + domain._source.anchor;
var domain_content = "";

if (
typeof domain._source.display_name === "string" &&
domain._source.display_name.length >= 2
domain._source.display_name.length >= 1
) {
domain_subtitle_link.html(
"(" +
domain._source.role_name +
")" +
domain._source.display_name
);
} else {
domain_subtitle_link.html(domain._source.role_name);
domain_subtitle = "(" + domain._source.role_name + ") " + domain._source.display_name;
}

// preparing domain_content
domain_content.append(domain._source.type_display + " -- ");
// domain_content = type_display --
domain_content = domain._source.type_display + " -- ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird. Should this be a template also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reduced the ifs here.
It is less complicated and more readable now.

if (domain.highlight) {
if (domain.highlight["domains.name"]) {
domain_content.append(
xss(domain.highlight["domains.name"][0])
);
// domain_content = type_display -- name
domain_content += xss(domain.highlight["domains.name"][0]);
} else {
domain_content.append(domain._source.name);
// domain_content = type_display -- name
domain_content += domain._source.name;
}
} else {
// domain_content = type_display -- name
domain_content += domain._source.name;
}
domain_content.append(" -- in " + domain._source.doc_display);

domain_subtitle.append(domain_subtitle_link);
contents.append(domain_subtitle);
contents.append(domain_content);
contents.find('em').addClass('highlighted');
// domain_content = type_display -- name -- in doc_display
domain_content += " -- in " + domain._source.doc_display;

contents.append(
$u.template(
domain_template,
{
domain_subtitle_link: domain_subtitle_link,
domain_subtitle: domain_subtitle,
domain_content: domain_content
}
)
);
}

contents.find('em').addClass('highlighted');
list_item.append(contents);
list_item.append($("<br>"));
}
Expand Down
2 changes: 1 addition & 1 deletion readthedocs/core/static/core/js/readthedocs-doc-embed.js

Large diffs are not rendered by default.

31 changes: 7 additions & 24 deletions readthedocs/projects/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1282,13 +1282,8 @@ def fileify(version_pk, commit, build):
except Exception:
log.exception('Failed during ImportedFile creation')

try:
_update_intersphinx_data(version, path, commit, build)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this as 3 functions called from here, instead of one. That will make it more robust to issues of exceptions. So we should have:

  • _create_imported_files
  • _create_intersphinx_data
  • _sync_imported_files

And refactor the code to match

except Exception:
log.exception('Failed during SphinxDomain creation')


def _update_intersphinx_data(version, path, commit, build):
def _create_intersphinx_data(version, path, commit, build):
"""
Update intersphinx data for this version.

Expand Down Expand Up @@ -1376,24 +1371,6 @@ def warn(self, msg):
build=build,
)

# Index new SphinxDomain objects to elasticsearch
index_new_files(model=SphinxDomain, version=version, build=build)

# Remove old SphinxDomain from elasticsearch
remove_indexed_files(
model=SphinxDomain,
version=version,
build=build,
)

# Delete SphinxDomain objects from the previous build of the version.
(
SphinxDomain.objects
.filter(project=version.project, version=version)
.exclude(build=build)
.delete()
)


def clean_build(version_pk):
"""Clean the files used in the build of the given version."""
Expand Down Expand Up @@ -1481,6 +1458,12 @@ def _manage_imported_files(version, path, commit, build):
build=build,
)

# create SphinxDomain objects
try:
_create_intersphinx_data(version, path, commit, build)
except Exception:
log.exception('Failed during SphinxDomain objects creation')

# Index new HTMLFiles to elasticsearch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the new _sync_imported_files function should start.

index_new_files(model=HTMLFile, version=version, build=build)

Expand Down
11 changes: 6 additions & 5 deletions readthedocs/search/api.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import itertools
import logging
from pprint import pformat

Expand Down Expand Up @@ -43,18 +44,18 @@ def get_highlight(self, obj):
def get_inner_hits(self, obj):
inner_hits = getattr(obj.meta, 'inner_hits', None)
if inner_hits:
sections = inner_hits.sections
domains = inner_hits.domains
all_results = list(sections) + list(domains)
sections = inner_hits.sections or []
domains = inner_hits.domains or []
all_results = itertools.chain(sections, domains)

sorted_results = [
sorted_results = (
{
'type': hit._nested.field,
'_source': hit._source.to_dict(),
'highlight': self._get_inner_hits_highlights(hit),
}
for hit in sorted(all_results, key=utils._get_hit_score, reverse=True)
]
)

return sorted_results

Expand Down
29 changes: 17 additions & 12 deletions readthedocs/search/documents.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,18 +111,23 @@ class Meta:

def prepare_domains(self, html_file):
"""Prepares and returns the values for domains field."""
domains_qs = html_file.sphinx_domains.all()
domains_qs = domains_qs.exclude(domain='std', type__in=['doc', 'label'])

all_domains = [{
'role_name': domain.role_name,
'doc_name': domain.doc_name,
'anchor': domain.anchor,
'type_display': domain.type_display,
'doc_display': domain.doc_display,
'name': domain.name,
'display_name': domain.display_name,
} for domain in domains_qs]
domains_qs = html_file.sphinx_domains.exclude(
domain='std',
type__in=['doc', 'label']
).iterator()

all_domains = [
{
'role_name': domain.role_name,
'doc_name': domain.doc_name,
'anchor': domain.anchor,
'type_display': domain.type_display,
'doc_display': domain.doc_display,
'name': domain.name,
'display_name': domain.display_name if domain.display_name != '-' else '',
}
for domain in domains_qs
]

return all_domains

Expand Down
12 changes: 6 additions & 6 deletions readthedocs/search/faceted_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,14 +101,14 @@ class PageSearchBase(RTDFacetedSearch):
doc_types = [PageDocument]
index = PageDocument._doc_type.index

outer_fields = ['title']
section_fields = ['sections.title', 'sections.content']
domain_fields = [
_outer_fields = ['title']
_section_fields = ['sections.title', 'sections.content']
_domain_fields = [
'domains.type_display',
'domains.name',
'domains.display_name',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which fields are to be included.

]
fields = outer_fields
fields = _outer_fields

# need to search for both 'and' and 'or' operations
# the score of and should be higher as it satisfies both or and and
Expand All @@ -125,7 +125,7 @@ def query(self, search, query):
sections_nested_query = self.generate_nested_query(
query=query,
path='sections',
fields=self.section_fields,
fields=self._section_fields,
inner_hits={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the DSL highlight function here, because it's a nested query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
Here's a example of it -- elastic/elasticsearch-dsl-py#960

'highlight': {
'number_of_fragments': 1,
Expand All @@ -141,7 +141,7 @@ def query(self, search, query):
domains_nested_query = self.generate_nested_query(
query=query,
path='domains',
fields=self.domain_fields,
fields=self._domain_fields,
inner_hits={
'highlight': {
'number_of_fragments': 1,
Expand Down
2 changes: 1 addition & 1 deletion readthedocs/search/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ def _remove_newlines_from_dict(highlight):
"""
for k, v in highlight.items():
if isinstance(v, dict):
highlight[k] = self._remove_newlines_from_dict(v)
highlight[k] = _remove_newlines_from_dict(v)
else:
# elastic returns the contents of the
# highlighted field in a list.
Expand Down
23 changes: 12 additions & 11 deletions readthedocs/search/views.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Search views."""
import collections
import itertools
import logging
from pprint import pformat

Expand Down Expand Up @@ -107,12 +108,14 @@ def elastic_search(request, project_slug=None):

# sorting inner_hits (if present)
try:
for hit in results.hits.hits:
sections = hit['inner_hits'].get('sections', [])
domains = hit['inner_hits'].get('domains', [])
all_results = list(sections) + list(domains)
for result in results:

sorted_results = [
inner_hits = result.meta.inner_hits
sections = inner_hits.sections or []
domains = inner_hits.domains or []
all_results = itertools.chain(sections, domains)

sorted_results = (
{
'type': hit._nested.field,

Expand All @@ -126,15 +129,13 @@ def elastic_search(request, project_slug=None):
),
}
for hit in sorted(all_results, key=utils._get_hit_score, reverse=True)
]
)

hit['inner_hits'].pop('sections', None)
hit['inner_hits'].pop('domains', None)
hit['inner_hits'] = sorted_results
result.meta.inner_hits = sorted_results

except:
except Exception as e:
# if the control comes in this block,
# that implies that there was PageSearch
# that implies that there was a PageSearch
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log this log.exception instead of pass, or fix this logic. We shouldn't hit this in a normal case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the logic.
Turns out that a simple if was sufficient. 😃
I have still wrapped the code in try... except block to log any exceptions (if occur) to help in debugging later.


log.debug('Search results: %s', pformat(results.to_dict()))
Expand Down
Loading