Design doc: Better handling of docs URLs #9425

stsewd · 2022-07-11T21:16:05Z

Preview at https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html

ericholscher

This is a great PR. Thanks for putting so much work into it @stsewd! I need to process a bit more here, but I'd love to hear a bit more about how you're imagining the implementation here. Would it be the same approach of having a project.urlconf, but just the backend is different?

ericholscher · 2022-07-11T22:47:13Z

docs/dev/design/better-doc-urls-handling.rst

+  - Health check
+  - Proxied APIs
+  - robots and sitemap
+  - The ``page`` redirect


Maybe... The main use case here would be prefixing I think, as well as users proxying to us. We have hit issues in the past where we assume that we own the entire domain's hosting (docs.example.org/_/), but when there's a prefix, we likely want to serve it at {prefix}/_/, in case we aren't hosting the root domain in some configuration.

Ohh, yes, then people that proxy us are the main target of the custom prefix feature I'd say, and others will probably use it to change the subprojects prefix and use the normal serving.

ericholscher · 2022-07-11T22:49:05Z

docs/dev/design/better-doc-urls-handling.rst

+- Should we use the urlconf from the subproject when processing it?
+  This is an URL like ``/projects/subproject/custom/prefix/en/latest/index.html``.
+
+  I don't think that's useful, but it should be easy to support if needed.


We need to think more about this. I can see users who might want to have eg. just a verison and not a translation exposed on the subproject:

/projects/subproject/1.0/index.html which would serve subproject/$lang/1.0/index.html or similar. Would we take the custom URLconf for the subproject from the canonical project? I don't know..

This also raises the questions about what language we will use, or would this be a new type of projects? "Single translation projects".

Hrm, perhaps. I guess it would work similarly to single version projects, and just remove it from the URL?

But them, we need to restrict those projects from having translations, otherwise we will try to serve them from the domain of the main translation, and that link would be impossible to generate.

Instead of adding extra attributes to the project and telling people to enable/disable them, we can define the type of project via its URL pattern.

/<lang>/<version>/<filename>/

multilanguage

multiversion

/<lang>/<filename>

multilanguage

single version

/<version>/<filename>/

multiversion

single language

If the user tells us what the URLs will look like, we know exactly what type of project it is and how to parse and generate the URLs for them.

ericholscher · 2022-07-11T22:50:00Z

docs/dev/design/better-doc-urls-handling.rst

+  I don't think that's useful, but it should be easy to support if needed.
+
+- Should we support the page redirect when using a custom subproject prefix?
+  This is ``/{prefix}/subproject/page/index.html``.


Not a requirement, but would be useful if it isn't difficult. Would also be cool if users could customize it :)

/{prefix}/subproject/yolo-latest-version/index.html 🚀

docs/dev/design/better-doc-urls-handling.rst

humitos

This is good 💯

I need to think a little more about the whole document because it was hard to digest. I like the idea of catching all the URL at a single entry point. However, immediately after thinking about that scenario, I thought that we could invert the logic about how we are handling URL. Instead of guessing what type of project it is from the URL, we could get that data from the db using the project's slug and with that information apply the specific regex to the URL for that type of project. If it does not match, it's a 404

Then, if we decide to continue moving forward processing the request because the URL matches, we will have all the data about that URL into the regex's group (similar to the work that the unresolver does). With that information, we know exactly what file we have to serve.

Customers using custom URLs, only needs to define the pattern (regex) their URLs will have and proxito will use that custom regex instead of the pre-defined ones.

I also added some comments about building an extensible pattern where customers can modify the way their URLs look like, without being tied to the pattern /language/version/. Allowing them to go "multi-language and single-version" or "single-version and multi-language", etc.

docs/dev/design/better-doc-urls-handling.rst

humitos · 2022-07-13T14:37:29Z

docs/dev/design/better-doc-urls-handling.rst

+Instead of trying to map a URL to a view,
+we first analyze the canonical project (given from the subdomain),
+and based on that we map each part of the URL (parts are the result of splitting the URL on ``/``)
+to the *current* project.


I think we can still base our logic on regexs for simplicity. However, instead of applying the regex first and then deciding what type of project it's, we should invert the logic:

decide what type of project it is (single version, multi versions, etc) based on its slug (e.g. getting it from the subdomain, HTTP header, domain itself, or any other method that we support) and its canonical project

apply the correct regex based on that

Available regex to match to the URL once we know what type of project it is:

single version

multi versions

custom URL pattern

etc

if the specific regex does not match for the project type, we are sure it's a 404

This approach will still require you to test with several regexes, if a project has subprojects it doesn't necessarily mean that it will always serve a subproject, this will also have the same problem of reserved paths, like a single version project with subprojects won't be able to have a projects path, since the regex will match that path as a subproject path instead of a projects directory.

Also, we already have all the information, we don't want to keep guessing the current project (with this proposal we are still guessing, but just the requested file, the current project is already inferred from the canonical project and the URL).

problem of reserved paths, like a single version project with subprojects won't be able to have a projects path

I think this is a limitation of the design, not the implementation.

With any possible implementation, how you determine if the URL starting with a reserved path (e.g. /projects/section/page.html) is for the single version project or from a subproject of it?

Also, we already have all the information, we don't want to keep guessing the current project

I don't follow you on this. Why are you saying that we will be guessing the current project when using regex but we won't when parsing the URL manually by splitting it by /?

humitos · 2022-07-13T14:59:02Z

docs/dev/design/better-doc-urls-handling.rst

+We always keep the lang/version/filename order,
+do we need/want to support changing this order?
+Doesn't seem useful to do so.
+
+So, what we need is have a way to specify a prefix only.


I think we can support this if we keep the regex approach. This looks like a good Enterprise feature to me.

Customers could define:

subprojects custom URL as /<subproject alias>/

matching URLs like docs.example.com/project1/ and docs.example.com/another/en/latest/

note there is no /projects/ in the URL at all

note project1 is single version and another is multilanguage and multiversion

another could also have another custom URL defined on it to be only multiversion, but no multilanguage, matching URLs like docs.example.com/another/latest/

regular URLs to remove multilanguage while keeping multiversion

invert language and versions for some reason

remove multiversion, but keep multilanguage

subprojects custom URL as //

This case is already supported with this proposal, there is an example at https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#project-with-custom-subproject-prefix-empty

regular URLs to remove multilanguage while keeping multiversion

I think this raises a question if we should have a different type of project, "Single translation project" #9425 (comment).

stsewd · 2022-07-13T19:12:13Z

but I'd love to hear a bit more about how you're imagining the implementation here. Would it be the same approach of having a project.urlconf, but just the backend is different?

Yeah, but we will have an extra attribute for the subproject prefix.

humitos

I insist on considering the approach for custom regex per project that will have less impact on the database (the proposed approach in this design doc adds 2 extra queries compared with the current implementation). Also, I think the logic I'm proposing here is easier to read/follow than the "Look up process" proposed in this design doc.

The logic would be in the following order:

get the canonical project
re.match("^/page/<filename>", url) -> it's a page redirect
re.match(project.url_pattern_subprojects, url) -> it's a valid subproject URL
re.match(project.url_pattern_docs, url) -> it's a valid docs URL

if none of the 3 matches, it's an invalid URL for that project and it will return 404 immediately.

As example, the projects patterns could take the following values:

project.url_pattern_subprojects_url is "^/subprojects/<subproject url>"
project.url_pattern_docs_url is "^/<language>/<version>/<filename>"

In the case of subprojects, we will then re.match(subproject.url_pattern_docs_url, "<subproject url>") allowing project and subproject to have different URL patterns (e.g. superproject could be single version and the subproject multiversion)

Marking a project as a single version will only change the url_pattern_docs to "^/<filename>/" and the logic does not need to care about that. It will just apply the regex to know if the URL is valid or not for that particular project.

Besides, matching a URL to a regex will immediately give us all the exact attributes we want to extract from it (using regex groups), without parsing it manually to find out if the en part of it is the version or the language, for example.

The only limitation that I'm currently seeing here is that the reserved path used to decide if the URL is for a subproject or not, can't be used as a URL in the canonical project. However, I think this is a limitation of the design, not of this implementation itself. Considering that this approach allows users to define that reserved path, I don't see it as a big issue, since they could define whatever they want for this. Example: "^/p/<subproject url>/"

Also also, with this approach we could even allow the users to define the "/page/" redirect to something different. Example: "^/redirect/<filename>"

Custom URLs only needs to modify the url_pattern_docs to something like "^/deeplearning/nemo/user-guide/docs/<language>/<version>/<filename>" without requiring any change in the logic. Besides, it supports prefix, sufix, changing the order of the language/version or something completely customized by the user.

In general, I'm also trying to reduce the "possible options" we have to handle and reduce them all to custom regexs. This way, we don't need to think if the project has translations, subprojects or if it's single version or has a custom prefix. If the URL matches the pattern defined, it's valid. That's all. If it's valid, we do have already all the required elements to point to the right filename on S3.

humitos · 2022-07-14T08:55:45Z

docs/dev/design/better-doc-urls-handling.rst

+- Check if the canonical project has translations
+  (the project itself is a translation if isn't a single version project),
+  and the first part is a language code and the second is a version.
+
+  - If the lang code doesn't match, we continue.
+  - If the lang code matches, but the version doesn't, we return 404.
+
+- Check if it has subprojects and the first part of the URL matches the subprojects prefix (``projects``),


Here we are adding two extra DB queries per request, compared to the current implementation:

project has translations

project has subprojects

This will for sure increase the response time in some miliseconds.

humitos · 2022-07-14T09:36:48Z

docs/dev/design/better-doc-urls-handling.rst

+     - project
+     -
+   * - /en/1.0/404
+     - 404


I guess the requested file here should be /1.0/404.html.

docs/dev/design/better-doc-urls-handling.rst

ericholscher · 2022-07-14T21:32:22Z

Just a reminder @stsewd to throw some basic pseudocode of your implementation here, so that we can discuss it a bit more clearly.

humitos · 2022-07-18T08:57:55Z

docs/dev/design/better-doc-urls-handling.rst

+      # Subprojects.
+      prefix = path_parts[0]
+      project_slug = path_parts[1]
+      if check_subprojects and prefix == "projects":


I'm confused with this line and the goal just added: "Removing reserved paths and ambiguities from URLs". I understand that we are not removing the reserved path /projects/. Am I misunderstanding the meaning of reserved path? Can you expand on the specifics of that goal?

projects isn't reserved with this implementation, single version projects can use that path to serve files, see the last two examples from https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#single-version-project-with-subprojects.

Considering the "Single version with subproject" example.

project is single version

it's build with HTMLDir URLs

it has a subproject called something

the something subproject is also single version

Now, let's say that we receive the URL /projects/something/:

file /projects/something/index.html exists in the parent project

subproject called something also has an index.html file at its root

What file will we serve? Whatever the answer be, these are the cases why I'm saying that we are not removing the "reserved path" and the design has these restrictions. It does not seem we can remove this ambiguity while keep using a "reserved path" to give URLs a different meaning. No matter the implementation.

Yea, I am a little confused here too. I thought the goal of this project was to improve our handling to allow us to be smarter about these cases. It seemed in the design doc that custom prefixes were supported, is that not shown in this code example?

@humitos to your point, if we're serving subprojects there will always be some kind of overlap. You can't serve 2 different directories from the same path and not have an order that they are evaluated. I think that's fine, as long as we let users decide that the prefix is, instead of hard-coding projects.

@humitos to your point, if we're serving subprojects there will always be some kind of overlap. You can't serve 2 different directories from the same path and not have an order that they are evaluated. I think that's fine, as long as we let users decide that the prefix is, instead of hard-coding projects.

Exactly. That's what I'm saying.

The proposed implementation in this document is hard-coding /projects/ and that's why I raised this here.

It seemed in the design doc that custom prefixes were supported, is that not shown in this code example?

I thought that was part of the custom URLs part that we wanted to not support yet, I can update the code example to integrate the suffix.

humitos · 2022-07-18T09:00:50Z

docs/dev/design/better-doc-urls-handling.rst

+               return resolve(
+                  canonical_project=project,
+                  path_parts=path_parts[2:],
+                  check_subprojects=False,
+               )


Is this calling itself in a recursive way or it's calling the current resolve function we have now?

this is recursive, yes, guess this mimics more what we currently call unresolve, and what we call resolve is more like a reverse (from django.urls.reverse).

docs/dev/design/better-doc-urls-handling.rst

stsewd · 2022-07-18T20:07:20Z

I have updated the document to:

Scope more the goals, this is to make it simpler to move on and avoid having to decide what to do with custom prefix, custom order, single translations, etc Design doc: Better handling of docs URLs #9425 (comment).
Add an example of how the implementation looks like.

Answering some questions

I insist on considering the approach for custom regex per project that will have less impact on the database (the proposed approach in this design doc adds 2 extra queries compared with the current implementation).

There is no major performance impact with this implementation, we only execute the required db lookups for each case, see https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#performance.

The current implementation is scattered around, so I can't get that precise number of db lookups of it, but we can't execute less than 3 queries to get the current subproject, the translation, and the version for a subproject with translations, for example.

Considering that this approach allows users to define that reserved path, I don't see it as a big issue, since they could define whatever they want for this. Example: "^/p//"

I'll be -1 on suggesting people to change the subprojects prefix, just so they can serve a projects directory.

humitos · 2022-07-19T09:14:44Z

Considering that this approach allows users to define that reserved path, I don't see it as a big issue, since they could define whatever they want for this. Example: "^/p//"

I'll be -1 on suggesting people to change the subprojects prefix, just so they can serve a projects directory.

It's not just to allow serving projects directory --which is a valid use case already reported by packaging.python.org (see #2292). It's about allowing Pro/Enterprise customers to have nicer URLs for their documentation without imposed limitations dictated by the implementation. Providing this flexibility to higher plans without introducing any extra costs on our side, is 💯 , in my opinion.

ericholscher · 2022-07-19T19:35:40Z

docs/dev/design/better-doc-urls-handling.rst

+      project_slug = path_parts[1]
+      if check_subprojects and prefix == "projects":
+         project = canonical_project.subprojects.filter(alias=project_slug).first()
+         if project:


So this does only serve subprojects from matching directory if there is a matching subproject. That seems 💯

ericholscher · 2022-07-19T21:24:18Z

I feel like we're getting close here, but we should probably have a short call to discuss. Let's plan to chat for 15 mins after the Sprint planning tomorrow on this?

ericholscher · 2022-07-20T18:24:05Z

After our call, I think I have come to the following understanding:

Original goals

The original goal of this PR is a few major things:

Refactor the code to simplify URL handling. Have a single view as an entry-point, and do the parsing of URL's there, and route to the proper code paths
Be smart about URL routing, so the subproject prefix (/projects/ by default) is only a reserved path for projects that actually have subprojects on them, instead of any project.
Improve the URL handling to support custom prefixes or null prefixes for subprojects. This allows projects to have:
- /api/ route to an project-api subproject with an alias of api when subproject_prefix = None
- /p/api/ route to project-api subproject with an alias of api when subproject_prefix = 'p'

Feedback

As part of the simplification, we introduced the idea of parsing URLs completely in Python rather than URL's. This is a good approach if we look at simple URL's that are standard in our platform. However, one of the additional goals that we discussed was better support for Custom URLs for Pro+ plans on commercial, as well as community projects:

Additional goal: Better handling of Custom URLs for projects, allowing them to serve, for example, version-only projects /1.0/.

Proposed approach

I believe that we can achieve all of these goals by using regex-based parsing and the current approach. An instructive example is again the API subproject:

default_subproject_urls = `/projects/$subproject_slug/$subproject_language/$subproject_version/`
subproject_urls = default_subproject_urls
if project.has_subprojects() and project.subproject_urlconf:
    subproject_urls = project.subproject_urlconf # Example value: `/$subproject_slug/$subproject_version/`
    parse_with_regex(url, subproject_urls)

I think that this allows:

Single Entry-point: we have Python code managing parsing in a smart way, but with regex doing the parsing
Conditional URL parsing based on project knowledge (if project.has_subprojects())
Custom prefix & URL handling that's customizable by projects (project.subproject_urlconf)

And the following URL to route:

/api/1.0/ would route to the project-api subproject with an alias of api, and a 1.0 version.

Proposed implementation

I'd vote that we hold off on the actual implementation of designing the subproject_urlconf string, and how to represent that in a smart way. We can consider that a v2 feature and not block this initial PR on it, but know that we have an entry point for injecting that logic in a clean way going forward.

humitos · 2022-07-21T10:11:35Z

@ericholscher I'm on board with this implementation 👍. It's exactly what I was proposing in #9425 (review) with the addition of subproject_prefix = None that I wasn't aware we wanted that.

stsewd · 2022-07-21T14:41:42Z

using regex-based parsing and the current approach

Do you mean the current approach documented in the design doc?

Because in your example, you are matching a full URL instead of checking if the subproject is a single version project.

But if you mean instead of extracting the URL parts manually, we use regex, but if we keep the same logic, I'm fine with that.

ericholscher · 2022-07-21T17:32:46Z

But if you mean instead of extracting the URL parts manually, we use regex, but if we keep the same logic, I'm fine with that.

Yes, that's what I'm proposing. I think we're all in agreement on "smartly" parsing the URL with knowledge of the project. I think that is the path forward 👍

ericholscher · 2022-08-17T21:14:30Z

@stsewd I think we can go ahead and call this done 👍

stsewd · 2022-08-17T23:53:15Z

@stsewd I think we can go ahead and call this done +1

Should I update the doc and merge this?

- Move unresolve_domain to unresolver - Use the new implementation from #9425 for the unresolver - Remove settings override from the unresolver, we don't need it. This doesn't use this new implementation to handle the URLs from proxito nor has support for a custom urlconf, those things will be implemented in another PR.

ericholscher · 2022-08-31T15:51:11Z

Yep 👍

Design doc: Better handling of docs URLs

cb100a8

Ref #8327

stsewd requested a review from a team as a code owner July 11, 2022 21:16

stsewd requested a review from agjohnson July 11, 2022 21:16

auto-assign bot assigned stsewd Jul 11, 2022

stsewd requested a review from ericholscher July 11, 2022 21:16

ericholscher reviewed Jul 11, 2022

View reviewed changes

docs/dev/design/better-doc-urls-handling.rst Show resolved Hide resolved

humitos reviewed Jul 13, 2022

View reviewed changes

humitos reviewed Jul 14, 2022

View reviewed changes

ericholscher reviewed Jul 14, 2022

View reviewed changes

docs/dev/design/better-doc-urls-handling.rst Show resolved Hide resolved

Updates from review

8d50412

humitos reviewed Jul 18, 2022

View reviewed changes

stsewd added 2 commits July 18, 2022 14:05

Small updates

24e9fef

Merge branch 'main' into desing-doc-proxito-v2

3055c6c

ericholscher reviewed Jul 19, 2022

View reviewed changes

stsewd added 3 commits July 20, 2022 11:10

Update code example

30bbcca

Another small update

ed29129

Update goals

0b21391

stsewd mentioned this pull request Aug 15, 2022

New unresolver implementation #9500

Merged

stsewd added 2 commits August 31, 2022 12:27

Updates

2658ec6

Merge branch 'main' into desing-doc-proxito-v2

015dd31

stsewd enabled auto-merge (squash) August 31, 2022 17:42

stsewd disabled auto-merge August 31, 2022 18:00

stsewd enabled auto-merge (squash) August 31, 2022 18:00

stsewd disabled auto-merge August 31, 2022 18:00

stsewd merged commit 91b670d into main Aug 31, 2022

stsewd deleted the desing-doc-proxito-v2 branch August 31, 2022 18:00

stsewd mentioned this pull request Sep 14, 2022

Unresolver: strict validation for external versions and other fixes #9534

Merged

Design doc: Better handling of docs URLs #9425

Design doc: Better handling of docs URLs #9425

Conversation

stsewd commented Jul 11, 2022 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos left a comment

Choose a reason for hiding this comment

humitos Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Jul 13, 2022

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Jul 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Jul 18, 2022 • edited Loading

humitos commented Jul 19, 2022

Choose a reason for hiding this comment

ericholscher commented Jul 19, 2022

ericholscher commented Jul 20, 2022 • edited Loading

Original goals

Feedback

Proposed approach

Proposed implementation

humitos commented Jul 21, 2022

stsewd commented Jul 21, 2022

ericholscher commented Jul 21, 2022

ericholscher commented Aug 17, 2022

stsewd commented Aug 17, 2022

ericholscher commented Aug 31, 2022

stsewd commented Jul 11, 2022 •

edited

Loading

humitos Jul 13, 2022 •

edited

Loading

ericholscher Jul 19, 2022 •

edited

Loading

stsewd commented Jul 18, 2022 •

edited

Loading

ericholscher commented Jul 20, 2022 •

edited

Loading