-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix bug that caused search objects not to delete #5487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The comment mostly explains it, but the delete objects code does another query: https://github.com/rtfd/readthedocs.org/blob/3340364802a8cdcc5f4145fbde2a5dd4f7087771/readthedocs/search/tasks.py#L59-L62 This should now fix search deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this will fix the issue. If functions listening to bulk_post_delete
are not executed immediately, delete_queryset.delete
will be executed and we will have the same that the one that we are trying to fix here, aren't we?
Django signals are processed synchronously, so it should work. I agree we should move to having this logic be handled all in the same task, instead of through signals, so we can make it clearer the ordering, but this should work. I also wish we had good tests for this, but I've never found a good way to add tests to the build process, and with search too. I think with a refactor we could at least mock things reasonably, but that's a bigger PR I think. |
@ericholscher I had tested and it was working. Whats the issue now? |
@safwanrahman in prod pages aren't being deleted from the index. |
I think we have disabled the autosync, that may cause the issue! |
No, the issue is as noted above, the queryset is being deleted before we query for the objects to remove them from the search index :) |
@ericholscher Can you check the autosync settings in Production? |
Because of we were deleting the object first, we were including it into the list |
We query it again where I linked in the description of this PR . |
|
@safwanrahman please take the time to read what I'm saying. We query it in the |
@ericholscher Sorry for the misunderstanding. Do there any way to remove the DB query from there? Do you have any idea? |
You are right. I was thinking on celery tasks. The PR looks correct to me. |
@@ -1324,17 +1324,13 @@ def _manage_imported_files(version, path, commit): | |||
) | |||
# Keep the objects into memory to send it to signal | |||
instance_list = list(delete_queryset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing list()
doesn't make to hit the db already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we can probably remove this if we want, since the objects should exist now.
@ericholscher I have checked the git log and it seems like we were running I think the ES objects should be removed after the object has been removed from the DB so that in case there are some error removing the object from DB, the object does not get removed from elasticsearch. |
I just copied the logic from the |
I have gone through the code, and it seems like the task |
Yeah. We can simplify the |
I think keeping the 2 tasks similar in structure is probably more important than the order in which we delete objects from the DB or ES. Arguably it's more important to delete them from ES and the DB, since they will be deleted from the DB again on a later run of the code, but ES won't automatically delete a file if it gets missed. We should probably update the ES delete logic to be the same, and depend on commit, so it will properly delete old files if we hit an exception during one run of the code. |
readthedocs/projects/tasks.py
Outdated
|
||
# Delete ImportedFiles from previous versions | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All HTMLFile
are ImportedFile
, but not all ImportedFile
are HTMLFile
. So we should delete the ImportedFile
also. The HTMLFile
was deleted above for dispatching signals as HTMLFile
is a proxy model.
If you would like to have both task in same structure, I think its better to pass the list in
Yeah. That maybe better idea in long run. But we were trying to map |
But right now there is no way to handle the case where we delete an HTMLFile object, but the ES index doesn't get updated, which feels like a much bigger issue. |
We can run a cronjob with celery-beat which will check every night whether any object got missed form ES or anything is non available in DB, but available in ES and fix according to it. Or we can automatically run reindex every night which will be much better as we do have zero down time indexing so no user will be impacted and the data will be consistent. Which one do you prefer? |
I think we should make the code do the correct logic, not add a bunch of operational overhead to our servers :) |
I agree! Our current code do correct logic as most of the time. But its not possible to do it always from the code end if the ES get degraded and we get timed out from Elasticsearch. |
This reverts commit d4a4b8e.
I've made a bunch of improvements to this logic in #5290, so I will merge this for a quick fix, and that will be the larger fix around removing commits via ES. |
The comment mostly explains it,
but the delete objects code does another query:
https://github.com/rtfd/readthedocs.org/blob/3340364802a8cdcc5f4145fbde2a5dd4f7087771/readthedocs/search/tasks.py#L59-L62
This should now fix search deletion.