Skip to content

Design doc: Better handling of docs URLs #9425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 31, 2022
Merged

Design doc: Better handling of docs URLs #9425

merged 9 commits into from
Aug 31, 2022

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Jul 11, 2022

@stsewd stsewd requested a review from a team as a code owner July 11, 2022 21:16
@stsewd stsewd requested a review from agjohnson July 11, 2022 21:16
@stsewd stsewd requested a review from ericholscher July 11, 2022 21:16
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR. Thanks for putting so much work into it @stsewd! I need to process a bit more here, but I'd love to hear a bit more about how you're imagining the implementation here. Would it be the same approach of having a project.urlconf, but just the backend is different?

- Health check
- Proxied APIs
- robots and sitemap
- The ``page`` redirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe... The main use case here would be prefixing I think, as well as users proxying to us. We have hit issues in the past where we assume that we own the entire domain's hosting (docs.example.org/_/), but when there's a prefix, we likely want to serve it at {prefix}/_/, in case we aren't hosting the root domain in some configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, yes, then people that proxy us are the main target of the custom prefix feature I'd say, and others will probably use it to change the subprojects prefix and use the normal serving.

- Should we use the urlconf from the subproject when processing it?
This is an URL like ``/projects/subproject/custom/prefix/en/latest/index.html``.

I don't think that's useful, but it should be easy to support if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think more about this. I can see users who might want to have eg. just a verison and not a translation exposed on the subproject:

/projects/subproject/1.0/index.html which would serve subproject/$lang/1.0/index.html or similar. Would we take the custom URLconf for the subproject from the canonical project? I don't know..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also raises the questions about what language we will use, or would this be a new type of projects? "Single translation projects".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, perhaps. I guess it would work similarly to single version projects, and just remove it from the URL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But them, we need to restrict those projects from having translations, otherwise we will try to serve them from the domain of the main translation, and that link would be impossible to generate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding extra attributes to the project and telling people to enable/disable them, we can define the type of project via its URL pattern.

  • /<lang>/<version>/<filename>/
    • multilanguage
    • multiversion
  • /<lang>/<filename>
    • multilanguage
    • single version
  • /<version>/<filename>/
    • multiversion
    • single language

If the user tells us what the URLs will look like, we know exactly what type of project it is and how to parse and generate the URLs for them.

I don't think that's useful, but it should be easy to support if needed.

- Should we support the page redirect when using a custom subproject prefix?
This is ``/{prefix}/subproject/page/index.html``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a requirement, but would be useful if it isn't difficult. Would also be cool if users could customize it :)

/{prefix}/subproject/yolo-latest-version/index.html 🚀

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good 💯

I need to think a little more about the whole document because it was hard to digest. I like the idea of catching all the URL at a single entry point. However, immediately after thinking about that scenario, I thought that we could invert the logic about how we are handling URL. Instead of guessing what type of project it is from the URL, we could get that data from the db using the project's slug and with that information apply the specific regex to the URL for that type of project. If it does not match, it's a 404

Then, if we decide to continue moving forward processing the request because the URL matches, we will have all the data about that URL into the regex's group (similar to the work that the unresolver does). With that information, we know exactly what file we have to serve.

Customers using custom URLs, only needs to define the pattern (regex) their URLs will have and proxito will use that custom regex instead of the pre-defined ones.

I also added some comments about building an extensible pattern where customers can modify the way their URLs look like, without being tied to the pattern /language/version/. Allowing them to go "multi-language and single-version" or "single-version and multi-language", etc.

Comment on lines 28 to 31
Instead of trying to map a URL to a view,
we first analyze the canonical project (given from the subdomain),
and based on that we map each part of the URL (parts are the result of splitting the URL on ``/``)
to the *current* project.
Copy link
Member

@humitos humitos Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still base our logic on regexs for simplicity. However, instead of applying the regex first and then deciding what type of project it's, we should invert the logic:

  • decide what type of project it is (single version, multi versions, etc) based on its slug (e.g. getting it from the subdomain, HTTP header, domain itself, or any other method that we support) and its canonical project
  • apply the correct regex based on that

Available regex to match to the URL once we know what type of project it is:

  • single version
  • multi versions
  • custom URL pattern
  • etc

if the specific regex does not match for the project type, we are sure it's a 404

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach will still require you to test with several regexes, if a project has subprojects it doesn't necessarily mean that it will always serve a subproject, this will also have the same problem of reserved paths, like a single version project with subprojects won't be able to have a projects path, since the regex will match that path as a subproject path instead of a projects directory.

Also, we already have all the information, we don't want to keep guessing the current project (with this proposal we are still guessing, but just the requested file, the current project is already inferred from the canonical project and the URL).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

problem of reserved paths, like a single version project with subprojects won't be able to have a projects path

I think this is a limitation of the design, not the implementation.

With any possible implementation, how you determine if the URL starting with a reserved path (e.g. /projects/section/page.html) is for the single version project or from a subproject of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we already have all the information, we don't want to keep guessing the current project

I don't follow you on this. Why are you saying that we will be guessing the current project when using regex but we won't when parsing the URL manually by splitting it by /?

Comment on lines +93 to +97
We always keep the lang/version/filename order,
do we need/want to support changing this order?
Doesn't seem useful to do so.

So, what we need is have a way to specify a prefix only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can support this if we keep the regex approach. This looks like a good Enterprise feature to me.

Customers could define:

  • subprojects custom URL as /<subproject alias>/
    • matching URLs like docs.example.com/project1/ and docs.example.com/another/en/latest/
    • note there is no /projects/ in the URL at all
    • note project1 is single version and another is multilanguage and multiversion
    • another could also have another custom URL defined on it to be only multiversion, but no multilanguage, matching URLs like docs.example.com/another/latest/
  • regular URLs to remove multilanguage while keeping multiversion
  • invert language and versions for some reason
  • remove multiversion, but keep multilanguage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprojects custom URL as //

This case is already supported with this proposal, there is an example at https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#project-with-custom-subproject-prefix-empty

regular URLs to remove multilanguage while keeping multiversion

I think this raises a question if we should have a different type of project, "Single translation project" #9425 (comment).

@stsewd
Copy link
Member Author

stsewd commented Jul 13, 2022

but I'd love to hear a bit more about how you're imagining the implementation here. Would it be the same approach of having a project.urlconf, but just the backend is different?

Yeah, but we will have an extra attribute for the subproject prefix.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I insist on considering the approach for custom regex per project that will have less impact on the database (the proposed approach in this design doc adds 2 extra queries compared with the current implementation). Also, I think the logic I'm proposing here is easier to read/follow than the "Look up process" proposed in this design doc.

The logic would be in the following order:

  1. get the canonical project
  2. re.match("^/page/<filename>", url) -> it's a page redirect
  3. re.match(project.url_pattern_subprojects, url) -> it's a valid subproject URL
  4. re.match(project.url_pattern_docs, url) -> it's a valid docs URL

if none of the 3 matches, it's an invalid URL for that project and it will return 404 immediately.

As example, the projects patterns could take the following values:

  • project.url_pattern_subprojects_url is "^/subprojects/<subproject url>"
  • project.url_pattern_docs_url is "^/<language>/<version>/<filename>"

In the case of subprojects, we will then re.match(subproject.url_pattern_docs_url, "<subproject url>") allowing project and subproject to have different URL patterns (e.g. superproject could be single version and the subproject multiversion)


Marking a project as a single version will only change the url_pattern_docs to "^/<filename>/" and the logic does not need to care about that. It will just apply the regex to know if the URL is valid or not for that particular project.

Besides, matching a URL to a regex will immediately give us all the exact attributes we want to extract from it (using regex groups), without parsing it manually to find out if the en part of it is the version or the language, for example.

The only limitation that I'm currently seeing here is that the reserved path used to decide if the URL is for a subproject or not, can't be used as a URL in the canonical project. However, I think this is a limitation of the design, not of this implementation itself. Considering that this approach allows users to define that reserved path, I don't see it as a big issue, since they could define whatever they want for this. Example: "^/p/<subproject url>/"

Also also, with this approach we could even allow the users to define the "/page/" redirect to something different. Example: "^/redirect/<filename>"

Custom URLs only needs to modify the url_pattern_docs to something like "^/deeplearning/nemo/user-guide/docs/<language>/<version>/<filename>" without requiring any change in the logic. Besides, it supports prefix, sufix, changing the order of the language/version or something completely customized by the user.


In general, I'm also trying to reduce the "possible options" we have to handle and reduce them all to custom regexs. This way, we don't need to think if the project has translations, subprojects or if it's single version or has a custom prefix. If the URL matches the pattern defined, it's valid. That's all. If it's valid, we do have already all the required elements to point to the right filename on S3.

Comment on lines 58 to 65
- Check if the canonical project has translations
(the project itself is a translation if isn't a single version project),
and the first part is a language code and the second is a version.

- If the lang code doesn't match, we continue.
- If the lang code matches, but the version doesn't, we return 404.

- Check if it has subprojects and the first part of the URL matches the subprojects prefix (``projects``),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are adding two extra DB queries per request, compared to the current implementation:

  • project has translations
  • project has subprojects

This will for sure increase the response time in some miliseconds.

- project
-
* - /en/1.0/404
- 404
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the requested file here should be /1.0/404.html.

@ericholscher
Copy link
Member

Just a reminder @stsewd to throw some basic pseudocode of your implementation here, so that we can discuss it a bit more clearly.

# Subprojects.
prefix = path_parts[0]
project_slug = path_parts[1]
if check_subprojects and prefix == "projects":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused with this line and the goal just added: "Removing reserved paths and ambiguities from URLs". I understand that we are not removing the reserved path /projects/. Am I misunderstanding the meaning of reserved path? Can you expand on the specifics of that goal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

projects isn't reserved with this implementation, single version projects can use that path to serve files, see the last two examples from https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#single-version-project-with-subprojects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the "Single version with subproject" example.

  • project is single version
  • it's build with HTMLDir URLs
  • it has a subproject called something
  • the something subproject is also single version

Now, let's say that we receive the URL /projects/something/:

  • file /projects/something/index.html exists in the parent project
  • subproject called something also has an index.html file at its root

What file will we serve? Whatever the answer be, these are the cases why I'm saying that we are not removing the "reserved path" and the design has these restrictions. It does not seem we can remove this ambiguity while keep using a "reserved path" to give URLs a different meaning. No matter the implementation.

Copy link
Member

@ericholscher ericholscher Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I am a little confused here too. I thought the goal of this project was to improve our handling to allow us to be smarter about these cases. It seemed in the design doc that custom prefixes were supported, is that not shown in this code example?

@humitos to your point, if we're serving subprojects there will always be some kind of overlap. You can't serve 2 different directories from the same path and not have an order that they are evaluated. I think that's fine, as long as we let users decide that the prefix is, instead of hard-coding projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humitos to your point, if we're serving subprojects there will always be some kind of overlap. You can't serve 2 different directories from the same path and not have an order that they are evaluated. I think that's fine, as long as we let users decide that the prefix is, instead of hard-coding projects.

Exactly. That's what I'm saying.

The proposed implementation in this document is hard-coding /projects/ and that's why I raised this here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed in the design doc that custom prefixes were supported, is that not shown in this code example?

I thought that was part of the custom URLs part that we wanted to not support yet, I can update the code example to integrate the suffix.

Comment on lines 444 to 448
return resolve(
canonical_project=project,
path_parts=path_parts[2:],
check_subprojects=False,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this calling itself in a recursive way or it's calling the current resolve function we have now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is recursive, yes, guess this mimics more what we currently call unresolve, and what we call resolve is more like a reverse (from django.urls.reverse).

@stsewd
Copy link
Member Author

stsewd commented Jul 18, 2022

I have updated the document to:

Answering some questions

I insist on considering the approach for custom regex per project that will have less impact on the database (the proposed approach in this design doc adds 2 extra queries compared with the current implementation).

There is no major performance impact with this implementation, we only execute the required db lookups for each case, see https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#performance.

The current implementation is scattered around, so I can't get that precise number of db lookups of it, but we can't execute less than 3 queries to get the current subproject, the translation, and the version for a subproject with translations, for example.

Considering that this approach allows users to define that reserved path, I don't see it as a big issue, since they could define whatever they want for this. Example: "^/p//"

I'll be -1 on suggesting people to change the subprojects prefix, just so they can serve a projects directory.

@humitos
Copy link
Member

humitos commented Jul 19, 2022

Considering that this approach allows users to define that reserved path, I don't see it as a big issue, since they could define whatever they want for this. Example: "^/p//"

I'll be -1 on suggesting people to change the subprojects prefix, just so they can serve a projects directory.

It's not just to allow serving projects directory --which is a valid use case already reported by packaging.python.org (see #2292). It's about allowing Pro/Enterprise customers to have nicer URLs for their documentation without imposed limitations dictated by the implementation. Providing this flexibility to higher plans without introducing any extra costs on our side, is 💯 , in my opinion.

project_slug = path_parts[1]
if check_subprojects and prefix == "projects":
project = canonical_project.subprojects.filter(alias=project_slug).first()
if project:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does only serve subprojects from matching directory if there is a matching subproject. That seems 💯

@ericholscher
Copy link
Member

I feel like we're getting close here, but we should probably have a short call to discuss. Let's plan to chat for 15 mins after the Sprint planning tomorrow on this?

@ericholscher
Copy link
Member

ericholscher commented Jul 20, 2022

After our call, I think I have come to the following understanding:

Original goals

The original goal of this PR is a few major things:

  • Refactor the code to simplify URL handling. Have a single view as an entry-point, and do the parsing of URL's there, and route to the proper code paths
  • Be smart about URL routing, so the subproject prefix (/projects/ by default) is only a reserved path for projects that actually have subprojects on them, instead of any project.
  • Improve the URL handling to support custom prefixes or null prefixes for subprojects. This allows projects to have:
    • /api/ route to an project-api subproject with an alias of api when subproject_prefix = None
    • /p/api/ route to project-api subproject with an alias of api when subproject_prefix = 'p'

Feedback

As part of the simplification, we introduced the idea of parsing URLs completely in Python rather than URL's. This is a good approach if we look at simple URL's that are standard in our platform. However, one of the additional goals that we discussed was better support for Custom URLs for Pro+ plans on commercial, as well as community projects:

  • Additional goal: Better handling of Custom URLs for projects, allowing them to serve, for example, version-only projects /1.0/.

Proposed approach

I believe that we can achieve all of these goals by using regex-based parsing and the current approach. An instructive example is again the API subproject:

default_subproject_urls = `/projects/$subproject_slug/$subproject_language/$subproject_version/`
subproject_urls = default_subproject_urls
if project.has_subprojects() and project.subproject_urlconf:
    subproject_urls = project.subproject_urlconf # Example value: `/$subproject_slug/$subproject_version/`
    parse_with_regex(url, subproject_urls)

I think that this allows:

  • Single Entry-point: we have Python code managing parsing in a smart way, but with regex doing the parsing
  • Conditional URL parsing based on project knowledge (if project.has_subprojects())
  • Custom prefix & URL handling that's customizable by projects (project.subproject_urlconf)

And the following URL to route:

  • /api/1.0/ would route to the project-api subproject with an alias of api, and a 1.0 version.

Proposed implementation

I'd vote that we hold off on the actual implementation of designing the subproject_urlconf string, and how to represent that in a smart way. We can consider that a v2 feature and not block this initial PR on it, but know that we have an entry point for injecting that logic in a clean way going forward.

@humitos
Copy link
Member

humitos commented Jul 21, 2022

@ericholscher I'm on board with this implementation 👍. It's exactly what I was proposing in #9425 (review) with the addition of subproject_prefix = None that I wasn't aware we wanted that.

@stsewd
Copy link
Member Author

stsewd commented Jul 21, 2022

using regex-based parsing and the current approach

Do you mean the current approach documented in the design doc?

Because in your example, you are matching a full URL instead of checking if the subproject is a single version project.

But if you mean instead of extracting the URL parts manually, we use regex, but if we keep the same logic, I'm fine with that.

@ericholscher
Copy link
Member

But if you mean instead of extracting the URL parts manually, we use regex, but if we keep the same logic, I'm fine with that.

Yes, that's what I'm proposing. I think we're all in agreement on "smartly" parsing the URL with knowledge of the project. I think that is the path forward 👍

@ericholscher
Copy link
Member

@stsewd I think we can go ahead and call this done 👍

@stsewd
Copy link
Member Author

stsewd commented Aug 17, 2022

@stsewd I think we can go ahead and call this done +1

Should I update the doc and merge this?

stsewd added a commit that referenced this pull request Aug 23, 2022
- Move unresolve_domain to unresolver
- Use the new implementation from #9425  for the unresolver
- Remove settings override from the unresolver, we don't need it.

This doesn't use this new implementation to handle the URLs from proxito nor has support for a custom urlconf, those things will be implemented in another PR.
@ericholscher
Copy link
Member

Yep 👍

@stsewd stsewd enabled auto-merge (squash) August 31, 2022 17:42
@stsewd stsewd disabled auto-merge August 31, 2022 18:00
@stsewd stsewd enabled auto-merge (squash) August 31, 2022 18:00
@stsewd stsewd disabled auto-merge August 31, 2022 18:00
@stsewd stsewd merged commit 91b670d into main Aug 31, 2022
@stsewd stsewd deleted the desing-doc-proxito-v2 branch August 31, 2022 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants