Skip to content

Upgrade Elasticsearch to version 6.x #4211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 19, 2018
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ python:
- 3.6
sudo: false
env:
- ES_VERSION=1.3.9 ES_DOWNLOAD_URL=https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-${ES_VERSION}.tar.gz
- ES_VERSION=6.2.4 ES_DOWNLOAD_URL=https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-${ES_VERSION}.tar.gz
matrix:
include:
- python: 2.7
Expand Down Expand Up @@ -42,3 +42,4 @@ notifications:
branches:
only:
- master
- search_upgrade
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting :)

3 changes: 2 additions & 1 deletion readthedocs/projects/admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from .forms import FeatureForm
from .models import (Project, ImportedFile, Feature,
ProjectRelationship, EmailHook, WebHook, Domain)
ProjectRelationship, EmailHook, WebHook, Domain, HTMLFile)
from .notifications import ResourceUsageNotification
from .tasks import remove_dir

Expand Down Expand Up @@ -206,3 +206,4 @@ def project_count(self, feature):
admin.site.register(Feature, FeatureAdmin)
admin.site.register(EmailHook)
admin.site.register(WebHook)
admin.site.register(HTMLFile)
1 change: 1 addition & 0 deletions readthedocs/projects/apps.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@ class ProjectsConfig(AppConfig):
def ready(self):
from readthedocs.projects import tasks
from readthedocs.worker import app

app.tasks.register(tasks.SyncRepositoryTask)
app.tasks.register(tasks.UpdateDocsTask)
7 changes: 7 additions & 0 deletions readthedocs/projects/managers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from django.db import models


class HTMLFileManager(models.Manager):

def get_queryset(self):
return super(HTMLFileManager, self).get_queryset().filter(name__endswith='.html')
37 changes: 36 additions & 1 deletion readthedocs/projects/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,14 @@
import fnmatch
import logging
import os
from builtins import object # pylint: disable=redefined-builtin

from builtins import object # pylint: disable=redefined-builtin
from django.conf import settings
from django.contrib.auth.models import User
from django.core.urlresolvers import NoReverseMatch, reverse
from django.db import models
from django.utils.encoding import python_2_unicode_compatible
from django.utils.functional import cached_property
from django.utils.translation import ugettext_lazy as _
from future.backports.urllib.parse import urlparse # noqa
from guardian.shortcuts import assign
Expand All @@ -24,14 +25,17 @@
from readthedocs.core.utils import broadcast, slugify
from readthedocs.projects import constants
from readthedocs.projects.exceptions import ProjectConfigurationError
from readthedocs.projects.managers import HTMLFileManager
from readthedocs.projects.querysets import (
ChildRelatedProjectQuerySet, FeatureQuerySet, ProjectQuerySet,
RelatedProjectQuerySet)
from readthedocs.projects.templatetags.projects_tags import sort_version_aware
from readthedocs.projects.utils import find_file
from readthedocs.projects.validators import validate_domain_name, validate_repository_url
from readthedocs.projects.version_handling import (
determine_stable_version, version_windows)
from readthedocs.restapi.client import api
from readthedocs.search.parse_json import process_file
from readthedocs.vcs_support.backends import backend_cls
from readthedocs.vcs_support.utils import Lock, NonBlockingLock

Expand Down Expand Up @@ -910,6 +914,37 @@ def __str__(self):
return '%s: %s' % (self.name, self.project)


class HTMLFile(ImportedFile):

"""
Imported HTML file Proxy model.

This tracks only the HTML files for indexing to search.
"""

class Meta(object):
proxy = True

objects = HTMLFileManager()

@cached_property
def json_file_path(self):
basename = os.path.splitext(self.path)[0]
file_path = basename + '.fjson'

full_json_path = self.project.get_production_media_path(type_='json',
version_slug=self.version.slug,
include_file=False)

file_path = os.path.join(full_json_path, file_path)
return file_path

@cached_property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long is this cached for? Will we really see value in caching it vs. the tradeoff of keeping a bunch of JSON in memory?

Copy link
Member Author

@safwanrahman safwanrahman Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per documentation cached result will persist as long as the instance does.
As its called multiple time for same instance while indexing, (eg path, content, header), I think the cache helps much in indexing fast.

def processed_json(self):
file_path = self.json_file_path
return process_file(file_path)


class Notification(models.Model):
project = models.ForeignKey(Project,
related_name='%(class)s_notifications')
Expand Down
3 changes: 2 additions & 1 deletion readthedocs/projects/signals.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

from __future__ import absolute_import
import django.dispatch
from django.db.models.signals import pre_save
from django.dispatch import receiver

from readthedocs.oauth.utils import attach_webhook

from .models import HTMLFile

before_vcs = django.dispatch.Signal(providing_args=["version"])
after_vcs = django.dispatch.Signal(providing_args=["version"])
Expand Down
19 changes: 15 additions & 4 deletions readthedocs/projects/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from __future__ import absolute_import

import datetime
import fnmatch
import hashlib
import json
import logging
Expand All @@ -29,7 +30,7 @@

from .constants import LOG_TEMPLATE
from .exceptions import RepositoryError
from .models import ImportedFile, Project, Domain
from .models import ImportedFile, Project, Domain, HTMLFile
from .signals import before_vcs, after_vcs, before_build, after_build, files_changed
from readthedocs.builds.constants import (LATEST,
BUILD_STATE_CLONING,
Expand Down Expand Up @@ -943,18 +944,23 @@ def _manage_imported_files(version, path, commit):
changed_files = set()
for root, __, filenames in os.walk(path):
for filename in filenames:
if fnmatch.fnmatch(filename, '*.html'):
model_class = HTMLFile
else:
model_class = ImportedFile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is needed, since it's all the same model in the database.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is with actually signal manager. I have opened django-es/django-elasticsearch-dsl#111 about this.

Untill it has been fixed, we need to have a proxy model for the purpose, I believe.


dirpath = os.path.join(root.replace(path, '').lstrip('/'),
filename.lstrip('/'))
full_path = os.path.join(root, filename)
md5 = hashlib.md5(open(full_path, 'rb').read()).hexdigest()
try:
obj, __ = ImportedFile.objects.get_or_create(
obj, __ = model_class.objects.get_or_create(
project=version.project,
version=version,
path=dirpath,
name=filename,
)
except ImportedFile.MultipleObjectsReturned:
except model_class.MultipleObjectsReturned:
log.warning('Error creating ImportedFile')
continue
if obj.md5 != md5:
Expand All @@ -963,6 +969,12 @@ def _manage_imported_files(version, path, commit):
if obj.commit != commit:
obj.commit = commit
obj.save()

# Delete the HTMLFile first from previous versions
HTMLFile.objects.filter(project=version.project,
version=version
).exclude(commit=commit).delete()

# Delete ImportedFiles from previous versions
ImportedFile.objects.filter(project=version.project,
version=version
Expand Down Expand Up @@ -1173,7 +1185,6 @@ def sync_callback(_, version_pk, commit, *args, **kwargs):
The first argument is the result from previous tasks, which we discard.
"""
fileify(version_pk, commit=commit)
update_search(version_pk, commit=commit)


@app.task()
Expand Down
22 changes: 13 additions & 9 deletions readthedocs/projects/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,22 @@ def version_from_slug(slug, version):
return v


def find_file(filename):
def find_file(basename, pattern, path):
"""
Recursively find matching file from the current working path.
Recursively find matching file.

:param file: Filename to match
:returns: A list of matching filenames.
:param basename: Basename of a file to match
:param pattern: Pattern to match
:param path: the directory to search for the file
:returns: path of matching file
"""
matches = []
for root, __, filenames in os.walk('.'):
for match in fnmatch.filter(filenames, filename):
matches.append(os.path.join(root, match))
return matches

for root, _, files in os.walk(path):
for filename in files:
file_basename = os.path.splitext(filename)[0]

if fnmatch.fnmatch(filename, pattern) and file_basename == basename:
return os.path.join(root, filename)


def run(*commands):
Expand Down
1 change: 1 addition & 0 deletions readthedocs/search/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SEARCH_EXCLUDED_FILE = ['search.html', 'genindex.html', 'py-modindex.html']
114 changes: 114 additions & 0 deletions readthedocs/search/documents.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
from django.db import models
from django_elasticsearch_dsl import DocType, Index, fields

from readthedocs.projects.models import Project, HTMLFile
from .conf import SEARCH_EXCLUDED_FILE

from readthedocs.search.faceted_search import ProjectSearch, FileSearch

project_index = Index('project')

project_index.settings(
number_of_shards=1,
number_of_replicas=0
)


@project_index.doc_type
class ProjectDocument(DocType):

class Meta(object):
model = Project
fields = ('name', 'slug', 'description')

url = fields.TextField(attr='get_absolute_url')
users = fields.NestedField(properties={
'username': fields.TextField(),
'id': fields.IntegerField(),
})
language = fields.KeywordField()

@classmethod
def faceted_search(cls, query, language=None, using=None, index=None):
kwargs = {
'using': using or cls._doc_type.using,
'index': index or cls._doc_type.index,
'doc_types': [cls],
'model': cls._doc_type.model,
'query': query
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic required? It seems a bit heavy/complex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, to keep alligned with the search method, we can keep this logic. maybe its not needed now, but it will be useful to keep it alligned.


if language:
kwargs['filters'] = {'language': language}

return ProjectSearch(**kwargs)


page_index = Index('page')

page_index.settings(
number_of_shards=1,
number_of_replicas=0
)


@page_index.doc_type
class PageDocument(DocType):

class Meta(object):
model = HTMLFile
fields = ('commit',)

project = fields.KeywordField(attr='project.slug')
version = fields.KeywordField(attr='version.slug')

title = fields.TextField(attr='processed_json.title')
headers = fields.TextField(attr='processed_json.headers')
content = fields.TextField(attr='processed_json.content')
path = fields.TextField(attr='processed_json.path')

@classmethod
def faceted_search(cls, query, projects_list=None, versions_list=None, using=None, index=None):
kwargs = {
'using': using or cls._doc_type.using,
'index': index or cls._doc_type.index,
'doc_types': [cls],
'model': cls._doc_type.model,
'query': query
}
filters = {}

if projects_list:
filters['project'] = projects_list
if versions_list:
filters['version'] = versions_list

kwargs['filters'] = filters

return FileSearch(**kwargs)

def get_queryset(self):
"""Overwrite default queryset to filter certain files to index"""
queryset = super(PageDocument, self).get_queryset()

# Do not index files that belong to non sphinx project
# Also do not index certain files
queryset = (queryset.filter(project__documentation_type='sphinx')
.exclude(name__in=SEARCH_EXCLUDED_FILE))
return queryset

def update(self, thing, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this update actually get called? I believe it's a signal attached to the saving of the ImportedFile objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its called everytime the object is created, updated, or deleted.
Also whenever we run the management command.
call from the registry

"""Overwrite in order to index only certain files"""

# Object not exist in the provided queryset should not be indexed
# TODO: remove this overwrite when the issue has been fixed
# See below link for more information
# https://github.com/sabricot/django-elasticsearch-dsl/issues/111
if isinstance(thing, HTMLFile):
# Its a model instance.
queryset = self.get_queryset()
obj = queryset.filter(pk=thing.pk)
if not obj.exists():
return None

return super(PageDocument, self).update(thing=thing, **kwargs)
30 changes: 30 additions & 0 deletions readthedocs/search/faceted_search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from elasticsearch_dsl import FacetedSearch, TermsFacet


class RTDFacetedSearch(FacetedSearch):
"""Overwrite the initialization in order too meet our needs"""

# TODO: Remove the overwrite when the elastic/elasticsearch-dsl-py#916
# See more: https://github.com/elastic/elasticsearch-dsl-py/issues/916

def __init__(self, using, index, doc_types, model, **kwargs):
self.using = using
self.index = index
self.doc_types = doc_types
self._model = model
super(RTDFacetedSearch, self).__init__(**kwargs)


class ProjectSearch(RTDFacetedSearch):
fields = ['name^5', 'description']
facets = {
'language': TermsFacet(field='language')
}


class FileSearch(RTDFacetedSearch):
fields = ['title^10', 'headers^5', 'content']
facets = {
'project': TermsFacet(field='project'),
'version': TermsFacet(field='version')
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how simple this is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. me too! the thing is made so simple by using OOP concepts!

3 changes: 1 addition & 2 deletions readthedocs/search/indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
import datetime

from elasticsearch import Elasticsearch, exceptions
from elasticsearch.helpers import bulk_index

from django.conf import settings

Expand Down Expand Up @@ -143,7 +142,7 @@ def bulk_index(self, data, index=None, chunk_size=500, parent=None,
docs.append(doc)

# TODO: This doesn't work with the new ES setup.
bulk_index(self.es, docs, chunk_size=chunk_size)
# bulk_index(self.es, docs, chunk_size=chunk_size)

def index_document(self, data, index=None, parent=None, routing=None):
doc = self.extract_document(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this entire file and other related code should be deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. it need to be deleted. I will delete once I implement the file searching functionality!

Expand Down
Loading