Design architecture of search #4094

safwanrahman · 2018-05-15T17:47:14Z

The search functionality hopefully will be rewritten in the upcoming months. Before we start, we need to make some architectural decision about it.

Currently the search functionality lives inside the readthedocs main application. But there are 2 kind of search. One resides inside project page and another one is on the readthedocs.org site. Both of the functionality is hosted on the core application.

I would like to propose to separate the search functionality in a different django application. So it will be easier to maintain the code as well as scale the application as per needed! The indexing as well as query will be made to the application through API and it will return the result through API.

Pros

Separate codebase easy to maintain
Scale separately accoding to resource needed
New contributors can jump on easily
Can be extended as per needed
Dont hamper the core application bandwith as per resource

Cons

Separate application means seperate deployment lifecycle
Separate resource is needed
Complex to keep search index in sync
Need to integrate it with the core application
Development setup will be complex to setup

What do you think @ericholscher @agjohnson @humitos @davidfischer @stsewd

davidfischer · 2018-05-15T23:33:52Z

I'll preface my part of the discussion by saying that I'm not extremely familiar with our current search setup.

I believe our search solution:

It should work with other documentation formats (eg. MkDocs but others in the future) in addition to just Sphinx. Perhaps this involves an API for adding to the search index or perhaps we have a crawler that crawls build artifacts.
Search should run in development out of the box even if it runs with limited functionality. I don't really know how to get our current search working in dev. I'm glad you also called out the dev setup but I think this is critical to search being a success. The fact that search doesn't run easily out of the box is part of the reason it isn't worked on much.
There should be a way to get between project-specific search and all-project search. For example, here is a sample project-specific search but there's no link there to search all of RTD. Should this involve theme changes?

humitos · 2018-05-16T14:46:20Z

I want to echo all @davidfischer has said (including the part of not being very familiar with how our current search works :) )

Besides that, some specific comments:

I would like to propose to separate the search functionality in a different django application. So it will be easier to maintain the code as well as scale the application as per needed!

I want to mention a little our experience with readthedocs-build as a separate library (not django app, but library) and the final decision to merge it in the core repository: readthedocs/readthedocs-build#46 because the amount of extra work, the duplicated code and coordinated upgrades. There are some experience there that we should consider before making the decision on splitting it in a new django instance.

Instead of a new instance, isn't it better to have a new django app inside the current project?

There should be a way to get between project-specific search and all-project search

Won't this kill the servers?

Update: I just found this on the readthedocs.org site. https://readthedocs.org/search/?q=memory&type=file . Not sure if it's exactly what you are talking about or not though.

Search should run in development out of the box even if it runs with limited functionality

Totally agree with this. I wasn't able to set up the search in my local instance either. I don't know if you both have access to the readthedocs-ext repo, but if you do, you should check the search app (https://github.com/readthedocs/readthedocs-ext/tree/master/readthedocsext/search) where there are some code related to the search with probably help to understand the missing part of the search chain.

Would be good to have also this extra points:

good documentation of the structure, how the search works from the technical side (which will help folks to become a contributor)
good test suite covering most of the corner cases and normal cases

ericholscher · 2018-05-16T19:05:54Z

Follow up

Won't this kill the servers?

No. I think David was just suggesting a UI element that would let you expand the search, similar to what GitHub does.

This is a separate issue though, and should be raised elsewhere to continue discussion.

Search should run in development out of the box even if it runs with limited functionality

This is hard, because of Elastic Search. We can do some work around docker compose or something that could make this easier, but it will always require having an additional server process running as currently designed.

We could consider switching from ES to Postgres full-text search, but I think that's a larger issue, and I'm not fully convinced the Postgres will give us everything we need. Django does have extensions for Postgres' full-text search though (https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/search/), and we could use a separate Postgres instance in prod.

If we are still depending on ES, the local setup is always going to be decently complicated. The other option would be do implement again using something like haystack (http://django-haystack.readthedocs.io/en/master/). We moved away from this in the past because it wasn't being actively developed, and it didn't support new versions of ES that I can tell. It also means that we get a lot of support issues asking us to support other search backends, which isn't something we really want to do. It would give us a way to support local dev easier though.

Feedback

Feedback on a few things

In production, running search as a separate process

This is possible to do without having a separate repo for dev. For instance, we run api, web, and celery nodes and scale them separately in prod, all from the same codebase. We could do this similarly with search.

We also run the redirect app from a separate Flask codebase in prod, which is an example of us doing this and it working reasonably well. Having that split off has caused some confusion in the past though.

In development, depending on the RTD codebase

I think we should build the search functionality so that it could be run outside of RTD. This means not importing or depending on the actual RTD models as they exist in this codebase. It could live inside the RTD repo, but not depend on the Django models.

If we built this in another repo, that certainly makes that split very obvious. We could have the RTD codebase depend on a version (similar to readthedocs-build), and just use it like another django reusable app that would be developed by a third party.

Ease of development

The main value that I see in having it split out is allowing people to do development on it without depending on RTD itself. I think that is a good long term goal, but doesn't even really require having it in a separate repo if we design it properly.

Final thoughts

I'm still a bit torn on this. I think having it be it's own repo would be a nice thing. I do worry about the deployment and development overhead though.

I think I'm leaning towards having it included in the RTD repo for now, but designing it so that if we wanted to move it out in the future, we could do it with minimal effort.

safwanrahman · 2018-05-16T19:31:17Z

It should work with other documentation formats (eg. MkDocs but others in the future) in addition to just Sphinx. Perhaps this involves an API for adding to the search index or perhaps we have a crawler that crawls build artifacts.

Yes. I hope to work to index other documentation formats after making some progress with the current Sphinx one.

Search should run in development out of the box even if it runs with limited functionality. I don't really know how to get our current search working in dev. I'm glad you also called out the dev setup but I think this is critical to search being a success. The fact that search doesn't run easily out of the box is part of the reason it isn't worked on much.

I will eventually give a try to dockerize the application so all the stack can be run in separate container seamlessly. Maybe, in future, we need to work heavily in search feature because its much needed for big documentations!

There should be a way to get between project-specific search and all-project search. For example, here is a sample project-specific search but there's no link there to search all of RTD. Should this involve theme changes?

Currently, its possible to search project names in the core application and project search in project page. I think in future we can also add all the project search in our core application.

safwanrahman · 2018-05-18T21:17:04Z

@ericholscher

I think I'm leaning towards having it included in the RTD repo for now, but designing it so that if we wanted to move it out in the future, we could do it with minimal effort.

If we keep it inside the the RTD repo, we can use many built in signals of models like post_save and other in order to index in elasticsearch. Thats a good point also!

benjaoming · 2018-05-29T12:56:29Z

I have referenced this issue in our downstream docs project (see above) -- because it seems related: It has been flagged that search results need some kind of scoring or we need a way to configure exclusion of directories or sets of files from the search results (because they for one reason or the other aren't nice to have in search results).

ericholscher · 2018-05-29T13:45:48Z

Having more project control over search is definitely a good addition that we should think about. We could likely support metadata on the page, or some kind of setting which could support this. Thanks for the idea @benjaoming.

agjohnson · 2018-05-31T21:55:21Z

I think the disadvantages outweigh the advantages with regard to splitting out this code. Namely:

Splitting out code actually makes development and deployment more confusing for core contributions. This has been the downfall of readthedocs-build and readthedocs_sphinx_ext suffers the same neglect.
RTD is the only consumer of this application, we don't want to make something generalized as this complicates development in a different way.

I agree with the points about maintaining this in our main code base -- we can make this code isolated and not dependent on core code, but the code still lives alongside our other code. I'm not sure we have a reason yet to run a separate instance, but if we do, this is absolutely an option.

I vote to plan on keeping this code in our main code base.

agjohnson · 2018-06-08T22:06:18Z

Without any additional feedback here, I think we've determined that splitting this out to a microservice, or out to a separate repository, doesn't have many benefits for us.

pauloxnet · 2019-10-04T08:27:42Z

I know this issue is closed, but I want to add information about the full-text search with Django and PostgreSQL

We could consider switching from ES to Postgres full-text search, but I think that's a larger issue, and I'm not fully convinced the Postgres will give us everything we need. Django does have extensions for Postgres' full-text search though (https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/search/), and we could use a separate Postgres instance in prod.

The website djangoproject.com has full-text search powered only by Django itself and PostgreSQL.

We removed ES from the stack 2 years ago and now we have a simplified stack, faster indexing and great search performance.
django/djangoproject.com#797

I'm working with RTD and my first thought when I started was: "Is ES really necessary ? There's already PostgreSQL on the stack with all data, why can't we use it to have a full-text search ?"

safwanrahman added Improvement Minor improvement to code Needed: design decision A core team decision is required labels May 15, 2018

benjaoming mentioned this issue May 29, 2018

remove vestigial py_modules code dump learningequality/kolibri#3711

Merged

14 tasks

agjohnson changed the title ~~[Search] Design architecture of search~~ Design architecture of search May 31, 2018

agjohnson added this to the 2.6 milestone May 31, 2018

agjohnson added the Accepted Accepted issue on our roadmap label Jun 8, 2018

agjohnson closed this as completed Jun 8, 2018

agjohnson removed the Accepted Accepted issue on our roadmap label Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design architecture of search #4094

Design architecture of search #4094

safwanrahman commented May 15, 2018

davidfischer commented May 15, 2018

humitos commented May 16, 2018

ericholscher commented May 16, 2018

safwanrahman commented May 16, 2018

safwanrahman commented May 18, 2018

benjaoming commented May 29, 2018

ericholscher commented May 29, 2018

agjohnson commented May 31, 2018

agjohnson commented Jun 8, 2018

pauloxnet commented Oct 4, 2019

Design architecture of search #4094

Design architecture of search #4094

Comments

safwanrahman commented May 15, 2018

davidfischer commented May 15, 2018

humitos commented May 16, 2018

ericholscher commented May 16, 2018

Follow up

Feedback

In production, running search as a separate process

In development, depending on the RTD codebase

Ease of development

Final thoughts

safwanrahman commented May 16, 2018

safwanrahman commented May 18, 2018

benjaoming commented May 29, 2018

ericholscher commented May 29, 2018

agjohnson commented May 31, 2018

agjohnson commented Jun 8, 2018

pauloxnet commented Oct 4, 2019