Skip to content

Record search queries smartly #6088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 5, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion readthedocs/search/api.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import itertools
import logging

from django.utils import timezone
from rest_framework import generics, serializers
from rest_framework.exceptions import ValidationError
from rest_framework.pagination import PageNumberPagination
Expand Down Expand Up @@ -160,15 +161,19 @@ def list(self, request, *args, **kwargs):

project_slug = self.request.query_params.get('project', None)
version_slug = self.request.query_params.get('version', None)
query = self.request.query_params.get('q', '')
total_results = response.data.get('count', 0)
time = timezone.now()

query = self.request.query_params.get('q', '')
query = query.lower().strip()

# record the search query with a celery task
tasks.record_search_query.delay(
project_slug,
version_slug,
query,
total_results,
time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how tasks are serialized, this may not work exactly as planned. I know celery supports Pickle and JSON and I'm not sure which one we are using. Datetimes are not JSON serializable by default although Django does have a way to do it. However, I'm not sure they come back as timezone aware datetimes after they do. Did you verify this?

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During local development, celery uses json serializer.

CELERY_RESULT_SERIALIZER = 'json'

(Docs: https://docs.celeryproject.org/en/latest/userguide/configuration.html?highlight=CELERY_RESULT_SERIALIZER#new-lowercase-settings)

they come back as timezone aware datetimes after they do

I am not sure how to verify this.
I did print the timezone.now() and the SQL query -- that looks pretty similar.

timezone.now(): 2019-08-23 08:45:42.841161+00:00
SQL Query:

SELECT "search_searchquery"."id", "search_searchquery"."created", "search_searchquery"."modified", "search_searchquery"."project_id", "search_searchquery"."version_id", "search_searchquery"."query"
FROM "search_searchquery"
INNER JOIN "projects_project"
ON ("search_searchquery"."project_id" = "projects_project"."id")
INNER JOIN "builds_version"
ON ("search_searchquery"."version_id" = "builds_version"."id")
WHERE ("projects_project"."slug" = rtd AND "builds_version"."slug" = latest AND "search_searchquery"."query" LIKE read% ESCAPE '\' AND "search_searchquery"."created" >= 2019-08-23 08:45:32.841161)
ORDER BY "search_searchquery"."created" DESC

)

return response
50 changes: 43 additions & 7 deletions readthedocs/search/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,19 +142,45 @@ def delete_old_search_queries_from_db():


@app.task(queue='web')
def record_search_query(project_slug, version_slug, query, total_results):
"""Record search query in database."""
if not project_slug or not version_slug or not query or not total_results:
def record_search_query(project_slug, version_slug, query, total_results, time):
"""Record/update search query in database."""
if not project_slug or not version_slug or not query:
log.debug(
'Not recording the search query. Passed arguments: '
'project_slug: %s, version_slug: %s, query: %s, total_results: %s' % (
project_slug, version_slug, query, total_results
'project_slug: %s, version_slug: %s, query: %s, total_results: %s, time: %s' % (
project_slug, version_slug, query, total_results, time
)
)
return

project_qs = Project.objects.filter(slug=project_slug)
before_10_sec = time - timezone.timedelta(seconds=10)
partial_query_qs = SearchQuery.objects.filter(
project__slug=project_slug,
version__slug=version_slug,
query__startswith=query,
created__gte=before_10_sec,
).order_by('-created')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the performance characteristics of this. I don't believe created is indexed although project and version are. This might be worth testing on the live system before merging. This might be ok but I'm not 100% confident without just timing a similar query in prod.


# if partial query exists,
# just update it instead of creating new SearchQuery object.
if partial_query_qs.exists():
obj = partial_query_qs.first()
obj.created = time
obj.query = query
obj.save()
return

# don't record query with zero results.
if not total_results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this block isn't at the top of the function? I guess you could update the time on another query even if this one returns no results. However, that probably warrants a comment or something.

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this block isn't at the top of the function?

Yes. I wanted to update the timing and query even if no results are there. For example: readthed will return no results but I wanted to update the timing and query if there is a previous query of readth.

I guess you could update the time on another query even if this one returns no results. However, that probably warrants a comment or something.

I am not sure if I understood.

log.debug(
'Not recording search query because of zero results. Passed arguments: '
'project_slug: %s, version_slug: %s, query: %s, total_results: %s, time: %s' % (
project_slug, version_slug, query, total_results, time
)
)
return

project_qs = Project.objects.filter(slug=project_slug)
if not project_qs.exists():
log.debug(
'Not recording the search query because project does not exist. '
Expand All @@ -168,11 +194,21 @@ def record_search_query(project_slug, version_slug, query, total_results):
version_qs = Version.objects.filter(project=project, slug=version_slug)

if not version_qs.exists():
log.debug(
'Not recording the search query because version does not exist. '
'project_slug: %s, version_slug: %s' % (
project_slug, version_slug
)
)
return

version = version_qs.first()
SearchQuery.objects.create(

# make a new SearchQuery object.
obj = SearchQuery.objects.create(
project=project,
version=version,
query=query,
)
obj.created = time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an auto field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... but the actual time is when the query was made.
Celery task are not executed immediately so the automated time will be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that's probably OK to use the auto field for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be querying the modifed field, and using that -- it will automatically update whenever it is saved: https://github.com/django-extensions/django-extensions/blob/master/django_extensions/db/models.py#L19 -- we should likely use that, instead of trying to change the created date.

obj.save()