-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Use unicode-slugify to generate Version.slug #5186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8991505
to
514bfc3
Compare
Uses ``unicode-slugify`` to generate the slug. | ||
""" | ||
|
||
ok_chars = '-._' # dash, dot, underscore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the rfc https://tools.ietf.org/html/rfc1035
The following syntax will result in fewer problems with many
applications that use domain names (e.g., mail, TELNET).
<domain> ::= <subdomain> | " "
<subdomain> ::= <label> | <subdomain> "." <label>
<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
<let-dig-hyp> ::= <let-dig> | "-"
<let-dig> ::= <letter> | <digit>
_
underscore isn't allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it should start with a letter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn't change this behavior. We could consider.
Our current regex was working like that:
# Regex breakdown:
# [a-z0-9] -- start with alphanumeric value
# [-._a-z0-9] -- allow dash, dot, underscore, digit, lowercase ascii
# *? -- allow multiple of those, but be not greedy about the matching
# (?: ... ) -- wrap everything so that the pattern cannot escape when used in
# regexes.
VERSION_SLUG_REGEX = '(?:[a-z0-9A-Z][-._a-z0-9A-Z]*?)'
Basically, it does not allow .
, -
and _
as starting character, but it does allow a letter (lower and upper case) and a number.
If we want to be more strict and follow those directions, I suppose we should not allow the .
(dot) either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this only affects the version's slug, not the project's slug. We are safe here, sorry for the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use the same logic for project slugs if it's not explicitly defined by the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we use this for project's slug https://github.com/stsewd/readthedocs.org/blob/f1fd0c9827c10834277d63aa35cb4511ee3ae5ed/readthedocs/core/utils/__init__.py#L209-L219
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean removing the _
and .
. I mean, "why we don't use the all same logic for generating the slug for a Version than the slug for a Project?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Project's are used in DNS, but Versions aren't. They are different things, so we don't need the exact same restrictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand that.
I'm suggesting to use the same logic (unicode-slugify to generate the slug) but apply different restrictions.
Anyway, if so, it's for a different PR.
While I don't have a very strong opinion here, the 3rd party slugify method seems to be almost the same as Django's built-in one. Unless there's a good reason, I don't think we need to add a dependency here. |
@davidfischer if you take a look at the examples in the README, it works way better with unicode chars (Chinese, for example) I put the same example at #1410 (comment)
another example in Japanese:
Currently, the Chinese and Japanese example are broken. I reported this at mozilla/unicode-slugify#34 but I think that we should generate proper slug from the these names instead of an empty string. I'm blocking this PR because of the bug in the 3rd party library for now, but I'm open to more discussion on this PR. |
The trick is at this line: https://github.com/mozilla/unicode-slugify/blob/master/slugify/__init__.py#L80 that uses
Django version doesn't do that. |
Sorry, the broken examples are broken only in version 0.1.3 (which is the latest release). Under 0.1.5 they work properly and generate valid slugs:
|
I see what you're saying. I take back what I said and I think I'm good with this. Version slugs should have the decimal point and a few other characters. |
It's configured for |
514bfc3
to
2cc7d59
Compare
Already discussed that: project slug =! version slug since project slug is used as a subdomain and it has bigger restrictions. If we can use the same logic to do "the best" conversion from a Unicode char to its ASCII representation, we should do that in another PR. My idea here is to use "the same logic" but after that "apply different restrictions" to version slug than project slug to fulfill each needing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good change. This will only effect versions going forward right? Any time we change slugs I get scared. Did you test this with an existing project with the old slugs, and doing a build and making sure we didn't regenerate or create a bunch of new versions?
readthedocs/builds/version_slug.py
Outdated
@@ -26,6 +26,7 @@ | |||
|
|||
from django.db import models | |||
from django.utils.encoding import force_text | |||
from slugify import slugify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have a different import for this:
from slugify import slugify | |
from slugify import slugify as unicode_slugify |
readthedocs/builds/version_slug.py
Outdated
@@ -78,13 +81,37 @@ def get_queryset(self, model_cls, slug_field): | |||
return model._default_manager.all() | |||
return model_cls._default_manager.all() | |||
|
|||
def _normalize(self, content): | |||
""" | |||
Normalize some invalid characters to become a dash (``-``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a better docstring -- this tells me what not _why. Is there a reason we only do this for a subset of characters, but not others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the why: which is basically to keep compatibility.
@@ -68,6 +68,9 @@ django-kombu==0.9.4 | |||
mock==2.0.0 | |||
stripe==2.19.0 | |||
|
|||
# unicode-slugify==0.1.5 is not released on PyPI yet | |||
git+https://github.com/mozilla/unicode-slugify@b696c37#egg=unicode-slugify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙁
I tested this locally and it works 👍 We only compare the version name and id, not the slug. |
Yes. No migration or anything like that is applied.
Yes. I tested triggering a build of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'm a little hesitant to depend on a prerelease version, but I guess it's fine.
Our regex were replacing "all invalid chars by
-
" which I think it's not a very good pattern because we ended up with slugs liked-----branch---a
(multiple dashes together).This implementation only replaces
/
,%
,!
and?
with dashes to keep the behavior written in tests. I think that/
is a good one to replace because it's a common way to separate words, but I'm not sold with the rest ones.Besides, this PR uses
unicode-slugify
which work better with unicode project names using it's "best" ASCII representation.Closes #1410