Skip to content

[GSoC 2018] All Search Improvements #4636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 86 commits into from
Sep 28, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
3c41b42
first phase to elasticsearch 6.2.x
safwanrahman Jun 8, 2018
6410495
adding requirements
safwanrahman Jun 8, 2018
272b50a
implementing project search, test and travis fix
safwanrahman Jun 9, 2018
b8f1a06
fixing travis
safwanrahman Jun 9, 2018
6c430e5
fixing search install plugin
safwanrahman Jun 9, 2018
035c312
fixing up tests
safwanrahman Jun 9, 2018
746b378
fixing lint
safwanrahman Jun 9, 2018
de47978
first phase file search
safwanrahman Jun 13, 2018
ab6fffb
indexing the file objects
safwanrahman Jun 14, 2018
3523fab
File searching basic backend task has been implemented
safwanrahman Jun 14, 2018
9a5b0ed
integrate the new search with view and template
safwanrahman Jun 14, 2018
e9b1c03
fixing highlighting
safwanrahman Jun 15, 2018
37f6936
fixing up tests
safwanrahman Jun 18, 2018
f730556
adding migration
safwanrahman Jun 18, 2018
05f5e05
lint fix
safwanrahman Jun 18, 2018
0965a94
renameing
safwanrahman Jun 19, 2018
d4f6708
Merge pull request #4211 from safwanrahman/search
ericholscher Jun 19, 2018
c13798b
adding tests for case insensitive search
safwanrahman Jun 20, 2018
26e21de
fixup
safwanrahman Jun 20, 2018
a464ddc
[fix #2013] Delete search index after file get removed
safwanrahman Jun 20, 2018
10b7f36
fixup according to comments
safwanrahman Jun 20, 2018
54f0106
Implement exact match search and rewrite for operator ordering
safwanrahman Jun 22, 2018
3cdac0c
Merge pull request #4277 from safwanrahman/insensitive
ericholscher Jun 21, 2018
12d7f9b
adding test for faceted search
safwanrahman Jun 22, 2018
dd18370
fixing linter
safwanrahman Jun 22, 2018
e8ac769
[fix #4265] Port Document search API for Elasticsearch 6.x
safwanrahman Jun 27, 2018
e9bfeee
Merge pull request #4292 from safwanrahman/exact_match
ericholscher Jun 26, 2018
ccf2382
fixup and adding test
safwanrahman Jun 28, 2018
a82f006
more fixup
safwanrahman Jun 28, 2018
044565b
adding more tests
safwanrahman Jun 29, 2018
e2b8d8c
adding more tests and fixup
safwanrahman Jun 30, 2018
8dcc149
fixing lint
safwanrahman Jun 30, 2018
1a5b30e
fixing regex
safwanrahman Jul 2, 2018
f8d5e7f
Adding link to serialized data
safwanrahman Jul 4, 2018
d30bac3
fixing python3 compatibility
safwanrahman Jul 5, 2018
1b47227
Merge pull request #4309 from safwanrahman/search_api
ericholscher Jul 6, 2018
2586e15
fixup the migration
safwanrahman Jul 16, 2018
39ada00
Merge pull request #4340 from safwanrahman/docsearch
ericholscher Jul 16, 2018
8fc3b65
[Fix #4333] Implement asynchronous search reindex functionality usin…
safwanrahman Jul 13, 2018
fb16187
fixing lint
safwanrahman Jul 13, 2018
39d8031
fixup
safwanrahman Jul 14, 2018
bbbdca5
fixup message
safwanrahman Jul 16, 2018
faca6de
fixup index name
safwanrahman Jul 16, 2018
fd54d69
fixing command
safwanrahman Jul 16, 2018
b9dbb5d
fixing docstring
safwanrahman Jul 16, 2018
ce4abaf
optimizing indexing
safwanrahman Jul 18, 2018
db51a90
fixing tests and signals
safwanrahman Jul 19, 2018
7993f80
[Fix #4407] Port Project Search for Elasticsearch 6.x
safwanrahman Jul 19, 2018
665cc08
Merge pull request #4408 from safwanrahman/project_search
ericholscher Jul 24, 2018
612cfb8
[Fix #4409] Disable autoindexing in local development
safwanrahman Jul 27, 2018
baf8421
fixup as per comments
safwanrahman Jul 27, 2018
143ce7f
Optimizing reindexing management command
safwanrahman Jul 27, 2018
abaeade
adding migration
safwanrahman Jul 27, 2018
9d6f201
fixup lint
safwanrahman Jul 27, 2018
652f869
fixup as per comments
safwanrahman Jul 30, 2018
463f9e2
Merge pull request #4368 from safwanrahman/comman
ericholscher Jul 31, 2018
879b59c
Adding Documentation for Search
safwanrahman Aug 3, 2018
5a3d9c8
fixup
safwanrahman Aug 3, 2018
bbf0973
remove unneeded management command
safwanrahman Aug 3, 2018
e51d580
fixup as per review
safwanrahman Aug 7, 2018
c752e44
adding docs for architecture
safwanrahman Aug 7, 2018
568d8c6
fixup
safwanrahman Aug 8, 2018
e923884
more fixup
safwanrahman Aug 10, 2018
2a726db
fixup
safwanrahman Aug 10, 2018
6b27161
Merge remote-tracking branch 'origin/master' into search_upgrade
ericholscher Sep 6, 2018
9a78698
Handle exceptions raised on `process_file`.
ericholscher Sep 6, 2018
cb06923
fixing the indexing
safwanrahman Sep 6, 2018
d06e57e
fix
safwanrahman Sep 6, 2018
9dbc572
fixup
safwanrahman Sep 6, 2018
9fcdfc4
fixup
safwanrahman Sep 6, 2018
ad2d174
fixing lint
safwanrahman Sep 6, 2018
a508020
Merge pull request #4615 from safwanrahman/search_fix
ericholscher Sep 7, 2018
36bb8cd
Merge pull request #4467 from safwanrahman/search_docs
ericholscher Sep 7, 2018
aa2fe7b
fixing migrations
safwanrahman Sep 15, 2018
21fed3a
fixing lint
safwanrahman Sep 15, 2018
bf6ccbe
deleting old search code
safwanrahman Sep 15, 2018
1417f86
Merge pull request #4635 from safwanrahman/delete_old
ericholscher Sep 17, 2018
295f91a
fixup as per review
safwanrahman Sep 20, 2018
87691fa
fixing test
safwanrahman Sep 20, 2018
bcbdd13
Add ability to specify queue for indexing
ericholscher Sep 21, 2018
d5d7f7d
Fix test length
ericholscher Sep 21, 2018
83db570
Merge remote-tracking branch 'origin/master' into search_upgrade
ericholscher Sep 21, 2018
4b7c88d
Update common
ericholscher Sep 21, 2018
669fe22
update documentation
safwanrahman Sep 21, 2018
d1bba06
adding documentations for settings
safwanrahman Sep 21, 2018
bf71fb4
Merge branch 'master' into search_upgrade
ericholscher Sep 28, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ python:
- 3.6
sudo: false
env:
- ES_VERSION=1.3.9 ES_DOWNLOAD_URL=https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-${ES_VERSION}.tar.gz
- ES_VERSION=6.2.4 ES_DOWNLOAD_URL=https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-${ES_VERSION}.tar.gz
matrix:
include:
- python: 3.6
Expand Down
6 changes: 6 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# -*- coding: utf-8 -*-
import pytest
from django.conf import settings
from rest_framework.test import APIClient

try:
# TODO: this file is read/executed even when called from ``readthedocsinc``,
Expand Down Expand Up @@ -44,3 +46,7 @@ def pytest_configure(config):
@pytest.fixture(autouse=True)
def settings_modification(settings):
settings.CELERY_ALWAYS_EAGER = True

@pytest.fixture
def api_client():
return APIClient()
108 changes: 0 additions & 108 deletions docs/custom_installs/elasticsearch.rst

This file was deleted.

110 changes: 110 additions & 0 deletions docs/development/search.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
Search
======

Read The Docs uses Elasticsearch_ instead of the built in Sphinx search for providing better search
results. Documents are indexed in the Elasticsearch index and the search is made through the API.
All the Search Code is open source and lives in the `GitHub Repository`_.
Currently we are using the `Elasticsearch 6.3`_ version.

Local Development Configuration
-------------------------------

Installing and running Elasticsearch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You need to install and run Elasticsearch_ version 6.3 on your local development machine.
You can get the installation instructions
`here <https://www.elastic.co/guide/en/elasticsearch/reference/6.3/install-elasticsearch.html>`_.
Otherwise, you can also start an Elasticsearch Docker container by running the following command::

docker run -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:6.3.2

Indexing into Elasticsearch
^^^^^^^^^^^^^^^^^^^^^^^^^^^
For using search, you need to index data to the Elasticsearch Index. Run ``reindex_elasticsearch``
management command::

./manage.py reindex_elasticsearch

For performance optimization, we implemented our own version of management command rather than
the built in management command provided by the `django-elasticsearch-dsl`_ package.

Auto Indexing
^^^^^^^^^^^^^
By default, Auto Indexing is turned off in development mode. To turn it on, change the
``ELASTICSEARCH_DSL_AUTOSYNC`` settings to `True` in the `readthedocs/settings/dev.py` file.
After that, whenever a documentation successfully builds, or project gets added,
the search index will update automatically.


Architecture
------------
The search architecture is devided into 2 parts.
One part is responsible for **indexing** the documents and projects and
the other part is responsible for querying the Index to show the proper results to users.
We use the `django-elasticsearch-dsl`_ package mostly to the keep the search working.
`django-elasticsearch-dsl`_ is a wrapper around `elasticsearch-dsl`_ for easy configuration
with Django.

Indexing
^^^^^^^^
All the Sphinx documents are indexed into Elasticsearch after the build is successful.
Currently, we do not index MkDocs documents to elasticsearch, but
`any kind of help is welcome <https://github.com/rtfd/readthedocs.org/issues/1088>`_.

How we index documentations
~~~~~~~~~~~~~~~~~~~~~~~~~~~

After any build is successfully finished, `HTMLFile` objects are created for each of the
``HTML`` files and the old version's `HTMLFile` object is deleted. By default,
`django-elasticsearch-dsl`_ package listens to the `post_create`/`post_delete` signals
to index/delete documents, but it has performance drawbacks as it send HTTP request whenever
any `HTMLFile` objects is created or deleted. To optimize the performance, `bulk_post_create`
and `bulk_post_delete` Signals_ are dispatched with list of `HTMLFIle` objects so its possible
to bulk index documents in elasticsearch ( `bulk_post_create` signal is dispatched for created
and `bulk_post_delete` is dispatched for deleted objects). Both of the signals are dispatched
with the list of the instances of `HTMLFile` in `instance_list` parameter.

We listen to the `bulk_post_create` and `bulk_post_delete` signals in our `Search` application
and index/delete the documentation content from the `HTMLFile` instances.


How we index projects
~~~~~~~~~~~~~~~~~~~~~
We also index project information in our search index so that the user can search for projects
from the main site. `django-elasticsearch-dsl`_ listen `post_create` and `post_delete` signals of
`Project` model and index/delete into Elasticsearch accordingly.


Elasticsearch Document
~~~~~~~~~~~~~~~~~~~~~~

`elasticsearch-dsl`_ provides model-like wrapper for the `Elasticsearch document`_.
As per requirements of `django-elasticsearch-dsl`_, it is stored in the
`readthedocs/search/documents.py` file.

**ProjectDocument:** It is used for indexing projects. Signal listener of
`django-elasticsearch-dsl`_ listens to the `post_save` signal of `Project` model and
then index/delete into Elasticsearch.

**PageDocument**: It is used for indexing documentation of projects. By default, the auto
indexing is turned off by `ignore_signals = settings.ES_PAGE_IGNORE_SIGNALS`.
`settings.ES_PAGE_IGNORE_SIGNALS` is `False` both in development and production.
As mentioned above, our `Search` app listens to the `bulk_post_create` and `bulk_post_delete`
signals and indexes/deleted documentation into Elasticsearch. The signal listeners are in
the `readthedocs/search/signals.py` file. Both of the signals are dispatched
after a successful documentation build.

The fields and ES Datatypes are specified in the `PageDocument`. The indexable data is taken
from `processed_json` property of `HTMLFile`. This property provides python dictionary with
document data like `title`, `headers`, `content` etc.


.. _Elasticsearch: https://www.elastic.co/products/elasticsearch
.. _Elasticsearch 6.3: https://www.elastic.co/guide/en/elasticsearch/reference/6.3/index.html
.. _GitHub Repository: https://github.com/rtfd/readthedocs.org/tree/master/readthedocs/search
.. _Elasticsearch document: https://www.elastic.co/guide/en/elasticsearch/guide/current/document.html
.. _django-elasticsearch-dsl: https://github.com/sabricot/django-elasticsearch-dsl
.. _elasticsearch-dsl: http://elasticsearch-dsl.readthedocs.io/en/latest/
.. _Signals: https://docs.djangoproject.com/en/2.1/topics/signals/
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ Information about development is also available:

changelog
install
development/search
architecture
tests
docs
Expand Down
2 changes: 1 addition & 1 deletion docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ need to install Python 2.7 with virtualenv in your system as well.
If you want full support for searching inside your Read the Docs
site you will need to install Elasticsearch_.

Ubuntu users could install this package by following :doc:`/custom_installs/elasticsearch`.
Follow :doc:`/development/search` documentation for more instruction.

.. note::

Expand Down
78 changes: 78 additions & 0 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,81 @@ ALLOW_ADMIN
Default: :djangosetting:`ALLOW_ADMIN`

Whether to include `django.contrib.admin` in the URL's.


ELASTICSEARCH_DSL
-----------------

Default:

.. code-block:: python

{
'default': {
'hosts': '127.0.0.1:9200'
},
}

Settings for elasticsearch connection.
This settings then pass to `elasticsearch-dsl-py.connections.configure`_


ES_INDEXES
----------

Default:

.. code-block:: python

{
'project': {
'name': 'project_index',
'settings': {'number_of_shards': 5,
'number_of_replicas': 0
}
},
'page': {
'name': 'page_index',
'settings': {
'number_of_shards': 5,
'number_of_replicas': 0,
}
},
}

Define the elasticsearch name and settings of all the index separately.
The key is the type of index, like ``project`` or ``page`` and the value is another
dictionary containing ``name`` and ``settings``. Here the ``name`` is the index name
and the ``settings`` is used for configuring the particular index.


ES_TASK_CHUNK_SIZE
------------------

Default: :djangosetting:`ES_TASK_CHUNK_SIZE`

The maximum number of data send to each elasticsearch indexing celery task.
This has been used while running ``elasticsearch_reindex`` management command.


ES_PAGE_IGNORE_SIGNALS
----------------------

Default: ``False``

This settings is used to determine whether to index each page separately into elasticsearch.
If the setting is ``True``, each ``HTML`` page will not be indexed separately but will be
indexed by bulk indexing.


ELASTICSEARCH_DSL_AUTOSYNC
--------------------------

Default: ``True``

This setting is used for automatically indexing objects to elasticsearch.
``False`` by default in development so it is possible to create
project and build documentations without having elasticsearch.


.. _elasticsearch-dsl-py.connections.configure: https://elasticsearch-dsl.readthedocs.io/en/stable/configuration.html#multiple-clusters
33 changes: 0 additions & 33 deletions readthedocs/core/management/commands/provision_elasticsearch.py

This file was deleted.

Loading