Skip to content

Reapply the Elastic Search upgrade to master #4722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jan 24, 2019
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
6a451ec
Reapply search upgrade to master.
ericholscher Dec 20, 2018
14a1c66
Remove hacked return value for testing
ericholscher Dec 20, 2018
2e36570
index project asynchronously
safwanrahman Dec 20, 2018
d2607e0
Merge pull request #5023 from safwanrahman/async_project
ericholscher Dec 20, 2018
ab023f7
Update docembed
ericholscher Dec 21, 2018
dbbd240
Merge remote-tracking branch 'origin/master' into search-reapply
ericholscher Jan 21, 2019
7308daf
Fix merge syntax
ericholscher Jan 21, 2019
12de58b
Don't do two searches
ericholscher Jan 21, 2019
46c4f20
Merge remote-tracking branch 'origin/master' into search-reapply
ericholscher Jan 22, 2019
60714be
Lint search branch
ericholscher Jan 22, 2019
c8adfb4
Fix env
ericholscher Jan 22, 2019
2d28826
Fix lint issues
ericholscher Jan 22, 2019
bb4db91
Update migration name
ericholscher Jan 22, 2019
61f3d3b
Merge remote-tracking branch 'origin/master' into search-reapply
ericholscher Jan 24, 2019
e289926
Adjust shards & replicas to use less memory
ericholscher Jan 24, 2019
9370e20
Remove test that was testing deleted code
ericholscher Jan 24, 2019
1ebe494
Properly use the HTML encoder on searches.
ericholscher Jan 24, 2019
2063d2d
Add an XSS test
ericholscher Jan 24, 2019
75c4ae3
Index HTMLDir projects properly
ericholscher Jan 24, 2019
46bc58f
HTMLDir excluded files
ericholscher Jan 24, 2019
f6523d4
Don't exclude all pages based on name
ericholscher Jan 24, 2019
02741ba
Remove hack doing only OR
ericholscher Jan 24, 2019
7d0e58c
Fix test
ericholscher Jan 24, 2019
444da14
New object syntax
ericholscher Jan 24, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ python:
matrix:
include:
- python: 3.6
env: TOXENV=py36 ES_VERSION=1.3.9 ES_DOWNLOAD_URL=https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-${ES_VERSION}.tar.gz
env: TOXENV=py36 ES_VERSION=6.2.4 ES_DOWNLOAD_URL=https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-${ES_VERSION}.tar.gz
- python: 3.6
env: TOXENV=docs
- python: 3.6
Expand Down
6 changes: 6 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# -*- coding: utf-8 -*-
import pytest
from django.conf import settings
from rest_framework.test import APIClient

try:
# TODO: this file is read/executed even when called from ``readthedocsinc``,
Expand Down Expand Up @@ -44,3 +46,7 @@ def pytest_configure(config):
@pytest.fixture(autouse=True)
def settings_modification(settings):
settings.CELERY_ALWAYS_EAGER = True

@pytest.fixture
def api_client():
return APIClient()
130 changes: 0 additions & 130 deletions docs/custom_installs/elasticsearch.rst

This file was deleted.

110 changes: 110 additions & 0 deletions docs/development/search.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
Search
======

Read The Docs uses Elasticsearch_ instead of the built in Sphinx search for providing better search
results. Documents are indexed in the Elasticsearch index and the search is made through the API.
All the Search Code is open source and lives in the `GitHub Repository`_.
Currently we are using the `Elasticsearch 6.3`_ version.

Local Development Configuration
-------------------------------

Installing and running Elasticsearch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You need to install and run Elasticsearch_ version 6.3 on your local development machine.
You can get the installation instructions
`here <https://www.elastic.co/guide/en/elasticsearch/reference/6.3/install-elasticsearch.html>`_.
Otherwise, you can also start an Elasticsearch Docker container by running the following command::

docker run -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:6.3.2

Indexing into Elasticsearch
^^^^^^^^^^^^^^^^^^^^^^^^^^^
For using search, you need to index data to the Elasticsearch Index. Run ``reindex_elasticsearch``
management command::

./manage.py reindex_elasticsearch

For performance optimization, we implemented our own version of management command rather than
the built in management command provided by the `django-elasticsearch-dsl`_ package.

Auto Indexing
^^^^^^^^^^^^^
By default, Auto Indexing is turned off in development mode. To turn it on, change the
``ELASTICSEARCH_DSL_AUTOSYNC`` settings to `True` in the `readthedocs/settings/dev.py` file.
After that, whenever a documentation successfully builds, or project gets added,
the search index will update automatically.


Architecture
------------
The search architecture is devided into 2 parts.
One part is responsible for **indexing** the documents and projects and
the other part is responsible for querying the Index to show the proper results to users.
We use the `django-elasticsearch-dsl`_ package mostly to the keep the search working.
`django-elasticsearch-dsl`_ is a wrapper around `elasticsearch-dsl`_ for easy configuration
with Django.

Indexing
^^^^^^^^
All the Sphinx documents are indexed into Elasticsearch after the build is successful.
Currently, we do not index MkDocs documents to elasticsearch, but
`any kind of help is welcome <https://github.com/rtfd/readthedocs.org/issues/1088>`_.

How we index documentations
~~~~~~~~~~~~~~~~~~~~~~~~~~~

After any build is successfully finished, `HTMLFile` objects are created for each of the
``HTML`` files and the old version's `HTMLFile` object is deleted. By default,
`django-elasticsearch-dsl`_ package listens to the `post_create`/`post_delete` signals
to index/delete documents, but it has performance drawbacks as it send HTTP request whenever
any `HTMLFile` objects is created or deleted. To optimize the performance, `bulk_post_create`
and `bulk_post_delete` Signals_ are dispatched with list of `HTMLFIle` objects so its possible
to bulk index documents in elasticsearch ( `bulk_post_create` signal is dispatched for created
and `bulk_post_delete` is dispatched for deleted objects). Both of the signals are dispatched
with the list of the instances of `HTMLFile` in `instance_list` parameter.

We listen to the `bulk_post_create` and `bulk_post_delete` signals in our `Search` application
and index/delete the documentation content from the `HTMLFile` instances.


How we index projects
~~~~~~~~~~~~~~~~~~~~~
We also index project information in our search index so that the user can search for projects
from the main site. `django-elasticsearch-dsl`_ listen `post_create` and `post_delete` signals of
`Project` model and index/delete into Elasticsearch accordingly.


Elasticsearch Document
~~~~~~~~~~~~~~~~~~~~~~

`elasticsearch-dsl`_ provides model-like wrapper for the `Elasticsearch document`_.
As per requirements of `django-elasticsearch-dsl`_, it is stored in the
`readthedocs/search/documents.py` file.

**ProjectDocument:** It is used for indexing projects. Signal listener of
`django-elasticsearch-dsl`_ listens to the `post_save` signal of `Project` model and
then index/delete into Elasticsearch.

**PageDocument**: It is used for indexing documentation of projects. By default, the auto
indexing is turned off by `ignore_signals = settings.ES_PAGE_IGNORE_SIGNALS`.
`settings.ES_PAGE_IGNORE_SIGNALS` is `False` both in development and production.
As mentioned above, our `Search` app listens to the `bulk_post_create` and `bulk_post_delete`
signals and indexes/deleted documentation into Elasticsearch. The signal listeners are in
the `readthedocs/search/signals.py` file. Both of the signals are dispatched
after a successful documentation build.

The fields and ES Datatypes are specified in the `PageDocument`. The indexable data is taken
from `processed_json` property of `HTMLFile`. This property provides python dictionary with
document data like `title`, `headers`, `content` etc.


.. _Elasticsearch: https://www.elastic.co/products/elasticsearch
.. _Elasticsearch 6.3: https://www.elastic.co/guide/en/elasticsearch/reference/6.3/index.html
.. _GitHub Repository: https://github.com/rtfd/readthedocs.org/tree/master/readthedocs/search
.. _Elasticsearch document: https://www.elastic.co/guide/en/elasticsearch/guide/current/document.html
.. _django-elasticsearch-dsl: https://github.com/sabricot/django-elasticsearch-dsl
.. _elasticsearch-dsl: http://elasticsearch-dsl.readthedocs.io/en/latest/
.. _Signals: https://docs.djangoproject.com/en/2.1/topics/signals/
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ to help you create fantastic documentation for your project.

changelog
install
development/search
architecture
tests
docs
Expand Down
4 changes: 1 addition & 3 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Additionally Read the Docs depends on:
* `Redis`_
* `Elasticsearch`_ (only if you want full support for searching inside the site)

* Ubuntu users could install this package by following :doc:`/custom_installs/elasticsearch`.
* Follow :doc:`/development/search` documentation for more instruction.

.. note::

Expand Down Expand Up @@ -56,8 +56,6 @@ you need these libraries.

.. tab:: CentOS/RHEL 7

Install::

sudo yum install python-devel python-pip libxml2-devel libxslt-devel

.. tab:: Other OS
Expand Down
78 changes: 78 additions & 0 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,81 @@ ALLOW_ADMIN
Default: :djangosetting:`ALLOW_ADMIN`

Whether to include `django.contrib.admin` in the URL's.


ELASTICSEARCH_DSL
-----------------

Default:

.. code-block:: python

{
'default': {
'hosts': '127.0.0.1:9200'
},
}

Settings for elasticsearch connection.
This settings then pass to `elasticsearch-dsl-py.connections.configure`_


ES_INDEXES
----------

Default:

.. code-block:: python

{
'project': {
'name': 'project_index',
'settings': {'number_of_shards': 5,
'number_of_replicas': 0
}
},
'page': {
'name': 'page_index',
'settings': {
'number_of_shards': 5,
'number_of_replicas': 0,
}
},
}

Define the elasticsearch name and settings of all the index separately.
The key is the type of index, like ``project`` or ``page`` and the value is another
dictionary containing ``name`` and ``settings``. Here the ``name`` is the index name
and the ``settings`` is used for configuring the particular index.


ES_TASK_CHUNK_SIZE
------------------

Default: :djangosetting:`ES_TASK_CHUNK_SIZE`

The maximum number of data send to each elasticsearch indexing celery task.
This has been used while running ``elasticsearch_reindex`` management command.


ES_PAGE_IGNORE_SIGNALS
----------------------

Default: ``False``

This settings is used to determine whether to index each page separately into elasticsearch.
If the setting is ``True``, each ``HTML`` page will not be indexed separately but will be
indexed by bulk indexing.


ELASTICSEARCH_DSL_AUTOSYNC
--------------------------

Default: ``True``

This setting is used for automatically indexing objects to elasticsearch.
``False`` by default in development so it is possible to create
project and build documentations without having elasticsearch.


.. _elasticsearch-dsl-py.connections.configure: https://elasticsearch-dsl.readthedocs.io/en/stable/configuration.html#multiple-clusters
Loading