Skip to content

Proxito: Don't hit storage for 404s #10617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 23, 2023
Merged

Proxito: Don't hit storage for 404s #10617

merged 6 commits into from
Aug 23, 2023

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Aug 9, 2023

We have all the information we need in the DB already.

There is one call to storage left, done by the robots.txt view, but I'm thinking of leaving that one out while we test this new way of handling files? I can also include it here, isn't a big change.

I also did a quick test on production checking connection.queries and .explain(), queries seem to be fast (0.001)

Closes #10512

We have all the information we need in the DB already.

Closes #10512
@stsewd stsewd force-pushed the use-db-instead-of-storage branch from c4e0d06 to 9a1bd6c Compare August 9, 2023 18:37
@stsewd stsewd marked this pull request as ready for review August 10, 2023 16:58
@stsewd stsewd requested a review from a team as a code owner August 10, 2023 16:58
@stsewd stsewd requested a review from humitos August 10, 2023 16:58
@stsewd
Copy link
Member Author

stsewd commented Aug 10, 2023

There is one call to storage left, done by the robots.txt view, but I'm thinking of leaving that one out while we test this new way of handling files? I can also include it here, isn't a big change.

Oh, we can't use the DB for that yet, we aren't tracking .txt files, only .html files.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. However, I wasn't sure if this PR was ready to review or not since it seems there are some work to do here still.

Oh, we can't use the DB for that yet, we aren't tracking .txt files, only .html files.

If you are not going to implement this on this PR, please create an issue referencing to this one to track this work.

Comment on lines +680 to +684
available_404_files = list(
HTMLFile.objects.filter(
version__in=versions_404, path__in=tryfiles
).values_list("version__slug", "path")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how performant is this query currently? I understand this table is pretty huge right now and we are thinking about shrinking it. After that, I suppose it will improve this time a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we have an issue to track this work? If not, we should create one so we reduce the data we save in this table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [22]: p = Project.objects.get(slug='docs')

In [23]: stable = p.versions.get(slug='stable')

In [24]: latest = p.versions.get(slug='latest')

In [25]: tryfiles = ['404.html', '404/index.html', 'intro/getting-started-with-mkdocs.html', 'intro/getting-started-foo/index.html']

In [26]: HTMLFile.objects.filter(version__in=[stable, latest], path__in=tryfiles).values_list('version__slug', 'path')

{'sql': 'SELECT "builds_version"."slug", "projects_importedfile"."path" FROM "projects_importedfile" INNER JOIN "builds
_version" ON ("projects_importedfile"."version_id" = "builds_version"."id") WHERE ("projects_importedfile"."name"::text
 LIKE \'%.html\' AND "projects_importedfile"."path" IN (\'404.html\', \'404/index.html\', \'intro/getting-started-with-
mkdocs.html\', \'intro/getting-started-foo/index.html\') AND "projects_importedfile"."version_id" IN (2604018, 2603893)
) LIMIT 21',
 'time': '0.005'}

This is a query with 4 files and two versions, it takes between 0.001 and 0.005. My only worry is that we will have an increase in queries to the DB, but probably isn't that bad, since these requests will be cached at the CDN level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we have an issue to track this work? If not, we should create one so we reduce the data we save in this table.

#10623

If you are not going to implement this on this PR, please create an issue referencing to this one to track this work.

#10659

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query can be faster based on a DB cache if it was run multiple times, so we'll have to check how fast it averages our in production, but it'll definitely be faster than hitting storage!

@humitos
Copy link
Member

humitos commented Aug 23, 2023

I'm happy to move forward with this 👍🏼

@stsewd stsewd merged commit dcfc862 into main Aug 23, 2023
@stsewd stsewd deleted the use-db-instead-of-storage branch August 23, 2023 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proxito 404: check for files in the DB instead of storage
3 participants