-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Design doc: Better handling of docs URLs #9425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR. Thanks for putting so much work into it @stsewd! I need to process a bit more here, but I'd love to hear a bit more about how you're imagining the implementation here. Would it be the same approach of having a project.urlconf
, but just the backend is different?
- Health check | ||
- Proxied APIs | ||
- robots and sitemap | ||
- The ``page`` redirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe... The main use case here would be prefixing I think, as well as users proxying to us. We have hit issues in the past where we assume that we own the entire domain's hosting (docs.example.org/_/
), but when there's a prefix, we likely want to serve it at {prefix}/_/
, in case we aren't hosting the root domain in some configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, yes, then people that proxy us are the main target of the custom prefix feature I'd say, and others will probably use it to change the subprojects prefix and use the normal serving.
- Should we use the urlconf from the subproject when processing it? | ||
This is an URL like ``/projects/subproject/custom/prefix/en/latest/index.html``. | ||
|
||
I don't think that's useful, but it should be easy to support if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to think more about this. I can see users who might want to have eg. just a verison and not a translation exposed on the subproject:
/projects/subproject/1.0/index.html
which would serve subproject/$lang/1.0/index.html
or similar. Would we take the custom URLconf for the subproject from the canonical project? I don't know..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also raises the questions about what language we will use, or would this be a new type of projects? "Single translation projects".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, perhaps. I guess it would work similarly to single version projects, and just remove it from the URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But them, we need to restrict those projects from having translations, otherwise we will try to serve them from the domain of the main translation, and that link would be impossible to generate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding extra attributes to the project and telling people to enable/disable them, we can define the type of project via its URL pattern.
/<lang>/<version>/<filename>/
- multilanguage
- multiversion
/<lang>/<filename>
- multilanguage
- single version
/<version>/<filename>/
- multiversion
- single language
If the user tells us what the URLs will look like, we know exactly what type of project it is and how to parse and generate the URLs for them.
I don't think that's useful, but it should be easy to support if needed. | ||
|
||
- Should we support the page redirect when using a custom subproject prefix? | ||
This is ``/{prefix}/subproject/page/index.html``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a requirement, but would be useful if it isn't difficult. Would also be cool if users could customize it :)
/{prefix}/subproject/yolo-latest-version/index.html
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good 💯
I need to think a little more about the whole document because it was hard to digest. I like the idea of catching all the URL at a single entry point. However, immediately after thinking about that scenario, I thought that we could invert the logic about how we are handling URL. Instead of guessing what type of project it is from the URL, we could get that data from the db using the project's slug and with that information apply the specific regex to the URL for that type of project. If it does not match, it's a 404
Then, if we decide to continue moving forward processing the request because the URL matches, we will have all the data about that URL into the regex's group (similar to the work that the unresolver
does). With that information, we know exactly what file we have to serve.
Customers using custom URLs, only needs to define the pattern (regex) their URLs will have and proxito will use that custom regex instead of the pre-defined ones.
I also added some comments about building an extensible pattern where customers can modify the way their URLs look like, without being tied to the pattern /language/version/
. Allowing them to go "multi-language and single-version" or "single-version and multi-language", etc.
Instead of trying to map a URL to a view, | ||
we first analyze the canonical project (given from the subdomain), | ||
and based on that we map each part of the URL (parts are the result of splitting the URL on ``/``) | ||
to the *current* project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still base our logic on regexs for simplicity. However, instead of applying the regex first and then deciding what type of project it's, we should invert the logic:
- decide what type of project it is (single version, multi versions, etc) based on its slug (e.g. getting it from the subdomain, HTTP header, domain itself, or any other method that we support) and its canonical project
- apply the correct regex based on that
Available regex to match to the URL once we know what type of project it is:
- single version
- multi versions
- custom URL pattern
- etc
if the specific regex does not match for the project type, we are sure it's a 404
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach will still require you to test with several regexes, if a project has subprojects it doesn't necessarily mean that it will always serve a subproject, this will also have the same problem of reserved paths, like a single version project with subprojects won't be able to have a projects
path, since the regex will match that path as a subproject path instead of a projects
directory.
Also, we already have all the information, we don't want to keep guessing the current project (with this proposal we are still guessing, but just the requested file, the current project is already inferred from the canonical project and the URL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
problem of reserved paths, like a single version project with subprojects won't be able to have a projects path
I think this is a limitation of the design, not the implementation.
With any possible implementation, how you determine if the URL starting with a reserved path (e.g. /projects/section/page.html
) is for the single version project or from a subproject of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we already have all the information, we don't want to keep guessing the current project
I don't follow you on this. Why are you saying that we will be guessing the current project when using regex but we won't when parsing the URL manually by splitting it by /
?
We always keep the lang/version/filename order, | ||
do we need/want to support changing this order? | ||
Doesn't seem useful to do so. | ||
|
||
So, what we need is have a way to specify a prefix only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can support this if we keep the regex approach. This looks like a good Enterprise feature to me.
Customers could define:
- subprojects custom URL as
/<subproject alias>/
- matching URLs like
docs.example.com/project1/
anddocs.example.com/another/en/latest/
- note there is no
/projects/
in the URL at all - note
project1
is single version andanother
is multilanguage and multiversion another
could also have another custom URL defined on it to be only multiversion, but no multilanguage, matching URLs likedocs.example.com/another/latest/
- matching URLs like
- regular URLs to remove multilanguage while keeping multiversion
- invert language and versions for some reason
- remove multiversion, but keep multilanguage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subprojects custom URL as //
This case is already supported with this proposal, there is an example at https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#project-with-custom-subproject-prefix-empty
regular URLs to remove multilanguage while keeping multiversion
I think this raises a question if we should have a different type of project, "Single translation project" #9425 (comment).
Yeah, but we will have an extra attribute for the subproject prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I insist on considering the approach for custom regex per project that will have less impact on the database (the proposed approach in this design doc adds 2 extra queries compared with the current implementation). Also, I think the logic I'm proposing here is easier to read/follow than the "Look up process" proposed in this design doc.
The logic would be in the following order:
- get the canonical project
re.match("^/page/<filename>", url)
-> it's a page redirectre.match(project.url_pattern_subprojects, url)
-> it's a valid subproject URLre.match(project.url_pattern_docs, url)
-> it's a valid docs URL
if none of the 3 matches, it's an invalid URL for that project and it will return 404 immediately.
As example, the projects patterns could take the following values:
project.url_pattern_subprojects_url
is"^/subprojects/<subproject url>"
project.url_pattern_docs_url
is"^/<language>/<version>/<filename>"
In the case of subprojects, we will then re.match(subproject.url_pattern_docs_url, "<subproject url>")
allowing project and subproject to have different URL patterns (e.g. superproject could be single version and the subproject multiversion)
Marking a project as a single version will only change the url_pattern_docs
to "^/<filename>/"
and the logic does not need to care about that. It will just apply the regex to know if the URL is valid or not for that particular project.
Besides, matching a URL to a regex will immediately give us all the exact attributes we want to extract from it (using regex groups), without parsing it manually to find out if the en
part of it is the version or the language, for example.
The only limitation that I'm currently seeing here is that the reserved path used to decide if the URL is for a subproject or not, can't be used as a URL in the canonical project. However, I think this is a limitation of the design, not of this implementation itself. Considering that this approach allows users to define that reserved path, I don't see it as a big issue, since they could define whatever they want for this. Example: "^/p/<subproject url>/"
Also also, with this approach we could even allow the users to define the "/page/"
redirect to something different. Example: "^/redirect/<filename>"
Custom URLs only needs to modify the url_pattern_docs
to something like "^/deeplearning/nemo/user-guide/docs/<language>/<version>/<filename>"
without requiring any change in the logic. Besides, it supports prefix, sufix, changing the order of the language/version or something completely customized by the user.
In general, I'm also trying to reduce the "possible options" we have to handle and reduce them all to custom regexs. This way, we don't need to think if the project has translations, subprojects or if it's single version or has a custom prefix. If the URL matches the pattern defined, it's valid. That's all. If it's valid, we do have already all the required elements to point to the right filename on S3.
- Check if the canonical project has translations | ||
(the project itself is a translation if isn't a single version project), | ||
and the first part is a language code and the second is a version. | ||
|
||
- If the lang code doesn't match, we continue. | ||
- If the lang code matches, but the version doesn't, we return 404. | ||
|
||
- Check if it has subprojects and the first part of the URL matches the subprojects prefix (``projects``), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are adding two extra DB queries per request, compared to the current implementation:
- project has translations
- project has subprojects
This will for sure increase the response time in some miliseconds.
- project | ||
- | ||
* - /en/1.0/404 | ||
- 404 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the requested file here should be /1.0/404.html
.
Just a reminder @stsewd to throw some basic pseudocode of your implementation here, so that we can discuss it a bit more clearly. |
# Subprojects. | ||
prefix = path_parts[0] | ||
project_slug = path_parts[1] | ||
if check_subprojects and prefix == "projects": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused with this line and the goal just added: "Removing reserved paths and ambiguities from URLs". I understand that we are not removing the reserved path /projects/
. Am I misunderstanding the meaning of reserved path? Can you expand on the specifics of that goal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
projects
isn't reserved with this implementation, single version projects can use that path to serve files, see the last two examples from https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#single-version-project-with-subprojects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the "Single version with subproject" example.
- project is single version
- it's build with HTMLDir URLs
- it has a subproject called
something
- the
something
subproject is also single version
Now, let's say that we receive the URL /projects/something/
:
- file
/projects/something/index.html
exists in the parent project - subproject called
something
also has anindex.html
file at its root
What file will we serve? Whatever the answer be, these are the cases why I'm saying that we are not removing the "reserved path" and the design has these restrictions. It does not seem we can remove this ambiguity while keep using a "reserved path" to give URLs a different meaning. No matter the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I am a little confused here too. I thought the goal of this project was to improve our handling to allow us to be smarter about these cases. It seemed in the design doc that custom prefixes were supported, is that not shown in this code example?
@humitos to your point, if we're serving subprojects there will always be some kind of overlap. You can't serve 2 different directories from the same path and not have an order that they are evaluated. I think that's fine, as long as we let users decide that the prefix is, instead of hard-coding projects
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@humitos to your point, if we're serving subprojects there will always be some kind of overlap. You can't serve 2 different directories from the same path and not have an order that they are evaluated. I think that's fine, as long as we let users decide that the prefix is, instead of hard-coding projects.
Exactly. That's what I'm saying.
The proposed implementation in this document is hard-coding /projects/
and that's why I raised this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed in the design doc that custom prefixes were supported, is that not shown in this code example?
I thought that was part of the custom URLs part that we wanted to not support yet, I can update the code example to integrate the suffix.
return resolve( | ||
canonical_project=project, | ||
path_parts=path_parts[2:], | ||
check_subprojects=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this calling itself in a recursive way or it's calling the current resolve
function we have now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is recursive, yes, guess this mimics more what we currently call unresolve
, and what we call resolve
is more like a reverse
(from django.urls.reverse).
I have updated the document to:
Answering some questions
There is no major performance impact with this implementation, we only execute the required db lookups for each case, see https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html#performance. The current implementation is scattered around, so I can't get that precise number of db lookups of it, but we can't execute less than 3 queries to get the current subproject, the translation, and the version for a subproject with translations, for example.
I'll be -1 on suggesting people to change the subprojects prefix, just so they can serve a |
It's not just to allow serving |
project_slug = path_parts[1] | ||
if check_subprojects and prefix == "projects": | ||
project = canonical_project.subprojects.filter(alias=project_slug).first() | ||
if project: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this does only serve subprojects from matching directory if there is a matching subproject. That seems 💯
I feel like we're getting close here, but we should probably have a short call to discuss. Let's plan to chat for 15 mins after the Sprint planning tomorrow on this? |
After our call, I think I have come to the following understanding: Original goalsThe original goal of this PR is a few major things:
FeedbackAs part of the simplification, we introduced the idea of parsing URLs completely in Python rather than URL's. This is a good approach if we look at simple URL's that are standard in our platform. However, one of the additional goals that we discussed was better support for Custom URLs for Pro+ plans on commercial, as well as community projects:
Proposed approachI believe that we can achieve all of these goals by using regex-based parsing and the current approach. An instructive example is again the API subproject:
I think that this allows:
And the following URL to route:
Proposed implementationI'd vote that we hold off on the actual implementation of designing the |
@ericholscher I'm on board with this implementation 👍. It's exactly what I was proposing in #9425 (review) with the addition of |
Do you mean the current approach documented in the design doc? Because in your example, you are matching a full URL instead of checking if the subproject is a single version project. But if you mean instead of extracting the URL parts manually, we use regex, but if we keep the same logic, I'm fine with that. |
Yes, that's what I'm proposing. I think we're all in agreement on "smartly" parsing the URL with knowledge of the project. I think that is the path forward 👍 |
@stsewd I think we can go ahead and call this done 👍 |
Should I update the doc and merge this? |
- Move unresolve_domain to unresolver - Use the new implementation from #9425 for the unresolver - Remove settings override from the unresolver, we don't need it. This doesn't use this new implementation to handle the URLs from proxito nor has support for a custom urlconf, those things will be implemented in another PR.
Yep 👍 |
Ref #8327
Preview at https://dev--9425.org.readthedocs.build/en/9425/design/better-doc-urls-handling.html