UTF8 characters on version slugging -- or slugging in general #1410

agjohnson · 2015-07-06T17:33:13Z

This ticket came up as part of #1407. We should make sure version slugging is handling UTF8 characters in a sane way.

gregmuellegger · 2015-07-06T18:30:30Z

Is there any algorithm you can suggest? Currently if a all-non-ascii name is provided the algorithm will return unknown. Partly non-ascii words like Straße will look "chunked" like stra-e.

ericholscher · 2015-07-06T18:32:44Z

This might be a good solution: https://pypi.python.org/pypi/unicode-slugify
-- It convert utf8 chars to their "equivalent" in ascii. We could also just
generate utf-8 slugs. I don't know a lot about this, but in general I think
we support UTF-8, but we might run into issues with nginx/etc downstream,
but perhaps that is the best solution.

On Mon, Jul 6, 2015 at 11:30 AM, Gregor Müllegger [email protected]
wrote:

Is there any algorithm you can suggest? Currently if a all-non-ascii name
is provided the algorithm will return unknown. Partly non-ascii words
like Straße will look "chunked" like stra-e.

—
Reply to this email directly or view it on GitHub
#1410 (comment)
.

Eric Holscher
Maker of the internet residing in Portland, Oregon
http://ericholscher.com

gregmuellegger · 2015-07-20T15:46:32Z

I think using UTF-8 in the URL is no good as there are just too many tools that do not properly handle UTF-8. However Wikipedia is doing it. The problem I see is with subdomains that contain project slugs. They definitely will make problems when they contain non-ascii characters.

humitos · 2017-06-05T02:54:09Z

We've been using the std library to handle this (convert all the unicode to its ascii representation):

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'camión').encode('ascii', 'ignore')
'camion'
>>>

humitos · 2018-12-05T11:47:04Z

think using UTF-8 in the URL is no good as there are just too many tools that do not properly handle UTF-8. However Wikipedia is doing it.

I think it's not a problem using unicode chars on the URL.

Actually, we do support this for filenames: https://test-builds.readthedocs.io/en/unicode-filename/

but we do replace the unicode chars when they are in the version's name/identifier/slug: https://test-builds.readthedocs.io/en/d--branch/

The problem I see is with subdomains that contain project slugs. They definitely will make problems when they contain non-ascii characters.

We should probably keep the project's slug as ASCII.

stale · 2019-01-19T12:20:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

humitos · 2019-01-28T13:39:06Z

I want to reopen it and apply the solution proposed by Eric: use unicode-slugify.

I also considered django.utils.text.slugify but it does not work as good as unicode-slugify:

django.utils.text.slugify:

In [1]: from django.utils.text import slugify                                                                                                                                                 

In [2]: slugify('北京 (capital of China)')                                                                                                                                                    
Out[2]: 'capital-of-china'

unicode-slugify:

from slugify import slugify
slugify(u'北京 (capital of China)', only_ascii=True)
# u'bei-jing-capital-of-china'

(there are more examples on its docs)

dojutsu-user · 2019-01-28T14:08:15Z

@humitos
Just a doubt.
I think to solve this, these lines
https://github.com/rtfd/readthedocs.org/blob/05cc76b89a72cc963e091854b38adf28ab19ae3e/readthedocs/builds/version_slug.py#L85-L87
are to be replaced by using unicode-slugify.

Something like:

from slugify import slugify
slugified = slugify(content)

humitos · 2019-01-28T14:30:31Z

@dojutsu-user yes, it is that line. Although, I've already created a for this issue: #5186

dojutsu-user · 2019-01-28T15:10:30Z

@humitos
Okay. 👍

agjohnson mentioned this issue Jul 6, 2015

Allow single char version slugs. #1407

Merged

gregmuellegger added the Improvement Minor improvement to code label Jul 20, 2015

agjohnson mentioned this issue Aug 25, 2015

error by the branch name with Unicode chars #1594

Closed

agjohnson mentioned this issue Sep 1, 2015

list of branches isn't updated #1617

Closed

chaudum mentioned this issue Nov 18, 2015

Project "versions page" returns Internal Server Error #1822

Closed

adamnovak mentioned this issue Feb 7, 2017

How to restore a project to functionality after pushing a branch with non-ASCII UTF-8 characters #2632

Closed

RichardLitt mentioned this issue Jan 18, 2018

Build failed without details #3531

Closed

humitos mentioned this issue Jul 26, 2018

Support git unicode branches #4433

Merged

humitos mentioned this issue Aug 2, 2018

Unicode in branch names gives unexpected error #3060

Closed

stale bot added the Status: stale Issue will be considered inactive soon label Jan 19, 2019

stale bot closed this as completed Jan 26, 2019

humitos reopened this Jan 28, 2019

stale bot removed the Status: stale Issue will be considered inactive soon label Jan 28, 2019

humitos added Accepted Accepted issue on our roadmap Status: stale Issue will be considered inactive soon labels Jan 28, 2019

stale bot removed the Status: stale Issue will be considered inactive soon label Jan 28, 2019

humitos mentioned this issue Jan 28, 2019

Use unicode-slugify to generate Version.slug #5186

Merged

humitos self-assigned this Feb 18, 2019

humitos closed this as completed in #5186 Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 characters on version slugging -- or slugging in general #1410

UTF8 characters on version slugging -- or slugging in general #1410

agjohnson commented Jul 6, 2015

gregmuellegger commented Jul 6, 2015

ericholscher commented Jul 6, 2015

gregmuellegger commented Jul 20, 2015

humitos commented Jun 5, 2017

humitos commented Dec 5, 2018

stale bot commented Jan 19, 2019

humitos commented Jan 28, 2019

dojutsu-user commented Jan 28, 2019

humitos commented Jan 28, 2019

dojutsu-user commented Jan 28, 2019

UTF8 characters on version slugging -- or slugging in general #1410

UTF8 characters on version slugging -- or slugging in general #1410

Comments

agjohnson commented Jul 6, 2015

gregmuellegger commented Jul 6, 2015

ericholscher commented Jul 6, 2015

gregmuellegger commented Jul 20, 2015

humitos commented Jun 5, 2017

humitos commented Dec 5, 2018

stale bot commented Jan 19, 2019

humitos commented Jan 28, 2019

dojutsu-user commented Jan 28, 2019

humitos commented Jan 28, 2019

dojutsu-user commented Jan 28, 2019