Skip to content

Unicode column misalignment #2612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Dec 29, 2012 · 9 comments
Closed

Unicode column misalignment #2612

wesm opened this issue Dec 29, 2012 · 9 comments
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings
Milestone

Comments

@wesm
Copy link
Member

wesm commented Dec 29, 2012

In [17]: open('/home/wesm/tmp/foo.csv', 'rb').read()
Out[17]: '\xe6\xb8\xac\xe8\xa9\xa6\xe4\xb8\x80,\xe6\xb8\xac\xe8\xa9\xa6\xe4\xb8\x89\r\[email protected],\xe6\xb8\xac\xe8\xa9\xa6\xe4\xb8\x80\r\[email protected],\xe6\xb8\xac\xe8\xa9\xa6\xe4\xba\x8c\r\[email protected],\xe6\xb8\xac\xe8\xa9\xa6\xe4\xb8\x89\r\n'

In [18]: read_csv('/home/wesm/tmp/foo.csv', encoding='utf-8')
Out[18]: 
               測試一  測試三
0  [email protected]  測試一
1  [email protected]  測試二
2  [email protected]  測試三

In [24]: df
Out[24]: 
               測試一  測試三
0  [email protected]  測試一
1  [email protected]  測試二
2  [email protected]  測試三

In [25]: df.columns[0]
Out[25]: u'\u6e2c\u8a66\u4e00'

In [26]: df.columns[1]
Out[26]: u'\u6e2c\u8a66\u4e09'
@wesm
Copy link
Member Author

wesm commented Dec 29, 2012

Actually, this may just be that monospace is not possible with chinese characters

@changhiskhan
Copy link
Contributor

it would significantly impact performance but we could use unicodedata.east_asian_width to check whether the chars are double width. Maybe do this after we include a .pandas file so if you do work with east asian fonts you can have it on by default?

@ariddell
Copy link

Did some work on this. Turns out the monospace Chinese characters in question are exactly 2 monospace ASCII characters wide.

commit 1002a365fb81291403ec43d253a5e97fdf3234f4 closes #2612

>>> df
               測試一  測試三
0  abc@example.com  測試一
1  def@example.com  測試二
2  ghi@example.com  測試三

Now:

>>> df
               測試一           測試三
0  abc@example.com        測試一   
1  def@example.com        測試二   
2  ghi@example.com        測試三   

The fix correctly calculates the width of the three (Chinese) character data as six display characters. Not sure why it's not also fixing the header display.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

let's push to 0.14, once .pandasrc in place this is easy to allow in as an option

@jreback jreback modified the milestones: Someday, 0.14.0 Mar 9, 2014
@sinhrks
Copy link
Member

sinhrks commented Aug 9, 2014

Looked little, and colwidth should handle other 4 Eastern Asia widthes('Na', 'N', 'H', 'A') . Also, I think common.adjoin and format._make_fixed_width should be fixed to change the number of padding spaces.

My current result

https://github.com/sinhrks/pandas/tree/unicode_justify

df = pd.DataFrame(np.random.randn(3, 2), columns=[u'パンダ子パンダ孫パンダ', u'もう笹飽きた'])
print(df)

2014-08-09 23 02 22

@ayapi
Copy link

ayapi commented Sep 12, 2015

Hello, this is very important issue for Japanese, Chinese, Korean
Please take actions.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

@sinhrks I think you have a branch with a possible fix?

can you reinvigorate when you have a chance. thxs.

@jreback jreback modified the milestones: 0.17.1, Someday Sep 12, 2015
@sinhrks
Copy link
Member

sinhrks commented Sep 12, 2015

Sure. The blocker was how to write a test both work on py2 and 3 (can't use escaped unicode because it changes eastern asia width). Now we can use u.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

right. further if we need to have an option (ok by me), i would use display.unicode.* (e.g. create a new namespace)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

6 participants