Fix bug that caused search objects not to delete #5487

ericholscher · 2019-03-18T22:05:48Z

The comment mostly explains it,
but the delete objects code does another query:

https://github.com/rtfd/readthedocs.org/blob/3340364802a8cdcc5f4145fbde2a5dd4f7087771/readthedocs/search/tasks.py#L59-L62

This should now fix search deletion.

The comment mostly explains it, but the delete objects code does another query: https://github.com/rtfd/readthedocs.org/blob/3340364802a8cdcc5f4145fbde2a5dd4f7087771/readthedocs/search/tasks.py#L59-L62 This should now fix search deletion.

humitos

I'm not sure that this will fix the issue. If functions listening to bulk_post_delete are not executed immediately, delete_queryset.delete will be executed and we will have the same that the one that we are trying to fix here, aren't we?

ericholscher · 2019-03-19T12:33:18Z

Django signals are processed synchronously, so it should work. I agree we should move to having this logic be handled all in the same task, instead of through signals, so we can make it clearer the ordering, but this should work.

I also wish we had good tests for this, but I've never found a good way to add tests to the build process, and with search too. I think with a refactor we could at least mock things reasonably, but that's a bigger PR I think.

safwanrahman · 2019-03-19T12:46:32Z

@ericholscher I had tested and it was working. Whats the issue now?

ericholscher · 2019-03-19T12:53:15Z

@safwanrahman in prod pages aren't being deleted from the index.

safwanrahman · 2019-03-19T13:18:07Z

I think we have disabled the autosync, that may cause the issue!
https://github.com/rtfd/readthedocs.org/blob/700b83df78417a095718ec8b641ac1750560b817/readthedocs/search/signals.py#L41-L42

ericholscher · 2019-03-19T13:19:23Z

No, the issue is as noted above, the queryset is being deleted before we query for the objects to remove them from the search index :)

safwanrahman · 2019-03-19T13:19:54Z

@ericholscher Can you check the autosync settings in Production? ELASTICSEARCH_DSL_AUTO_REFRESH key in settings?

safwanrahman · 2019-03-19T13:21:36Z

No, the issue is as noted above, the queryset is being deleted before we query for the objects to remove them from the search index :)

Because of we were deleting the object first, we were including it into the list
https://github.com/rtfd/readthedocs.org/blob/700b83df78417a095718ec8b641ac1750560b817/readthedocs/projects/tasks.py#L1326
So though the object is deleted from db, the signal can get it from the cached object thats in the memory.

ericholscher · 2019-03-19T13:23:39Z

We query it again where I linked in the description of this PR .

safwanrahman · 2019-03-19T13:25:16Z

We query it again where I linked in the description of this PR .

~~I think we are not querying it.~~ Oh. I did not notice it. I am checking it

ericholscher · 2019-03-19T13:26:53Z

@safwanrahman please take the time to read what I'm saying. We query it in the delete_objects_in_es that is called at the bottom of that function, as I already linked in the description of this PR.

safwanrahman · 2019-03-19T13:30:08Z

@ericholscher Sorry for the misunderstanding. Do there any way to remove the DB query from there? Do you have any idea?

humitos · 2019-03-19T14:05:57Z

Django signals are processed synchronously, so it should work

You are right. I was thinking on celery tasks.

The PR looks correct to me.

stsewd · 2019-03-19T14:52:38Z

readthedocs/projects/tasks.py

@@ -1324,17 +1324,13 @@ def _manage_imported_files(version, path, commit):
    )
    # Keep the objects into memory to send it to signal
    instance_list = list(delete_queryset)


Doing list() doesn't make to hit the db already?

Yea, we can probably remove this if we want, since the objects should exist now.

safwanrahman · 2019-03-19T15:26:41Z

@ericholscher I have checked the git log and it seems like we were running registry.delete(instance_list) in the past and it was removed with the current approach in 144ed0e. In current approach, another query to DB is needed. But in previous approach, DB query was not needed.
Do you remember why it was changed?

I think the ES objects should be removed after the object has been removed from the DB so that in case there are some error removing the object from DB, the object does not get removed from elasticsearch.

ericholscher · 2019-03-19T15:31:19Z

I just copied the logic from the index_html_file function above. If we can just pass an instance list, we should remove that logic from both places.

safwanrahman · 2019-03-19T15:32:46Z

I have gone through the code, and it seems like the task delete_objects_in_es is not get called from anywhere asynchronously. It was used to delete projects asynchronously, but later removed in #5262

safwanrahman · 2019-03-19T15:39:04Z

I just copied the logic from the index_html_file function above. If we can just pass an instance list, we should remove that logic from both places.

Yeah. We can simplify the index_html_file. We had the logic there because we wanted to use a generic task function index_objects_to_es, which needed the serialized data. If you agree, we can remove the extra logic in index_html_file.

ericholscher · 2019-03-19T15:40:12Z

I think keeping the 2 tasks similar in structure is probably more important than the order in which we delete objects from the DB or ES. Arguably it's more important to delete them from ES and the DB, since they will be deleted from the DB again on a later run of the code, but ES won't automatically delete a file if it gets missed.

We should probably update the ES delete logic to be the same, and depend on commit, so it will properly delete old files if we hit an exception during one run of the code.

safwanrahman · 2019-03-19T15:41:56Z

readthedocs/projects/tasks.py


-    # Delete ImportedFiles from previous versions
-    (


All HTMLFile are ImportedFile, but not all ImportedFile are HTMLFile. So we should delete the ImportedFile also. The HTMLFile was deleted above for dispatching signals as HTMLFile is a proxy model.

safwanrahman · 2019-03-19T15:53:40Z

I think keeping the 2 tasks similar in structure is probably more important than the order in which we delete objects from the DB or ES.

If you would like to have both task in same structure, I think its better to pass the list in registry.delete() and registry.update() from the signal. We can use the index_objects_to_es for indexing asynchronously when we needed (like from management command, project indexing, etc).

We should probably update the ES delete logic to be the same, and depend on commit, so it will properly delete old files if we hit an exception during one run of the code.

Yeah. That maybe better idea in long run. But we were trying to map HTMLFile with the ES index object. So if you think from that side, its better to have a tie with HTMLFile DB object.

ericholscher · 2019-03-19T16:02:06Z

Yeah. That maybe better idea in long run. But we were trying to map HTMLFile with the ES index object. So if you think from that side, its better to have a tie with HTMLFile DB object.

But right now there is no way to handle the case where we delete an HTMLFile object, but the ES index doesn't get updated, which feels like a much bigger issue.

safwanrahman · 2019-03-19T16:11:04Z

But right now there is no way to handle the case where we delete an HTMLFile object, but the ES index doesn't get updated, which feels like a much bigger issue.

We can run a cronjob with celery-beat which will check every night whether any object got missed form ES or anything is non available in DB, but available in ES and fix according to it. Or we can automatically run reindex every night which will be much better as we do have zero down time indexing so no user will be impacted and the data will be consistent. Which one do you prefer?

ericholscher · 2019-03-19T16:20:22Z

I think we should make the code do the correct logic, not add a bunch of operational overhead to our servers :)

safwanrahman · 2019-03-19T16:29:26Z

I agree! Our current code do correct logic as most of the time. But its not possible to do it always from the code end if the ES get degraded and we get timed out from Elasticsearch.
I think its better to have ES stale data rathar than having missing data.
Because of that, I have prosed about cronjob for data consistancy. If thats not possible, I think the current approach is good enough.

This reverts commit d4a4b8e.

ericholscher · 2019-03-20T15:40:47Z

I've made a bunch of improvements to this logic in #5290, so I will merge this for a quick fix, and that will be the larger fix around removing commits via ES.

ericholscher changed the title ~~Release 3.4.0~~ Fix bug that caused search objects not to delete Mar 18, 2019

ericholscher requested a review from a team March 18, 2019 22:06

humitos reviewed Mar 19, 2019

View reviewed changes

Remove redundent logic from below.

d4a4b8e

stsewd reviewed Mar 19, 2019

View reviewed changes

safwanrahman reviewed Mar 19, 2019

View reviewed changes

Revert "Remove redundent logic from below."

3d75dc3

This reverts commit d4a4b8e.

ericholscher merged commit 3699b4e into master Mar 20, 2019

delete-merged-branch bot deleted the search-delete-objects branch March 20, 2019 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug that caused search objects not to delete #5487

Fix bug that caused search objects not to delete #5487

ericholscher commented Mar 18, 2019 •

edited

Loading

humitos left a comment

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

safwanrahman commented Mar 19, 2019 •

edited

Loading

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019 •

edited

Loading

ericholscher commented Mar 19, 2019 •

edited

Loading

safwanrahman commented Mar 19, 2019

humitos commented Mar 19, 2019

stsewd Mar 19, 2019

ericholscher Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019 •

edited

Loading

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 20, 2019

Fix bug that caused search objects not to delete #5487

Fix bug that caused search objects not to delete #5487

Conversation

ericholscher commented Mar 18, 2019 • edited Loading

humitos left a comment

Choose a reason for hiding this comment

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

safwanrahman commented Mar 19, 2019 • edited Loading

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019 • edited Loading

ericholscher commented Mar 19, 2019 • edited Loading

safwanrahman commented Mar 19, 2019

humitos commented Mar 19, 2019

stsewd Mar 19, 2019

Choose a reason for hiding this comment

ericholscher Mar 19, 2019

Choose a reason for hiding this comment

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman Mar 19, 2019

Choose a reason for hiding this comment

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019 • edited Loading

ericholscher commented Mar 19, 2019

safwanrahman commented Mar 19, 2019

ericholscher commented Mar 20, 2019

ericholscher commented Mar 18, 2019 •

edited

Loading

safwanrahman commented Mar 19, 2019 •

edited

Loading

safwanrahman commented Mar 19, 2019 •

edited

Loading

ericholscher commented Mar 19, 2019 •

edited

Loading

safwanrahman commented Mar 19, 2019 •

edited

Loading