Skip to content

Commit 2a96074

Browse files
committed
ENH: Data formatting with unicode length
1 parent 5049b5e commit 2a96074

File tree

11 files changed

+1049
-81
lines changed

11 files changed

+1049
-81
lines changed

doc/source/options.rst

+53
Original file line numberDiff line numberDiff line change
@@ -440,3 +440,56 @@ For instance:
440440
pd.reset_option('^display\.')
441441
442442
To round floats on a case-by-case basis, you can also use :meth:`~pandas.Series.round` and :meth:`~pandas.DataFrame.round`.
443+
444+
.. _options.east_asian_width:
445+
446+
Unicode Formatting
447+
------------------
448+
449+
.. warning::
450+
451+
Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower).
452+
Use only when it is actually required.
453+
454+
Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets.
455+
If DataFrame or Series contains these characters, default output cannot be aligned properly.
456+
457+
.. ipython:: python
458+
459+
df = pd.DataFrame({u'国籍': ['UK', u'日本'], u'名前': ['Alice', u'しのぶ']})
460+
df
461+
462+
Enable ``display.unicode.east_asian_width`` allows pandas to check each character's "East Asian Width" property.
463+
These characters can be aligned properly by checking this property, but it takes longer time than standard ``len`` function.
464+
465+
.. ipython:: python
466+
467+
pd.set_option('display.unicode.east_asian_width', True)
468+
df
469+
470+
In addition, Unicode contains characters which width is "Ambiguous". These character's width should be either 1 or 2 depending on terminal setting or encoding. Because this cannot be distinguished from Python, ``display.unicode.ambiguous_as_wide`` option is added to handle this.
471+
472+
By default, "Ambiguous" character's width, "¡" (inverted exclamation) in below example, is regarded as 1.
473+
474+
.. note::
475+
476+
This should be aligned properly in terminal which uses monospaced font.
477+
478+
.. ipython:: python
479+
480+
df = pd.DataFrame({'a': ['xxx', u'¡¡'], 'b': ['yyy', u'¡¡']})
481+
df
482+
483+
Enabling ``display.unicode.ambiguous_as_wide`` lets pandas to regard these character's width as 2. Note that this option will be effective only when ``display.unicode.east_asian_width`` is enabled. Confirm starting position has been changed, but not aligned properly because the setting is mismatched with this environment.
484+
485+
.. ipython:: python
486+
487+
pd.set_option('display.unicode.ambiguous_as_wide', True)
488+
df
489+
490+
.. ipython:: python
491+
:suppress:
492+
493+
pd.set_option('display.unicode.east_asian_width', False)
494+
pd.set_option('display.unicode.ambiguous_as_wide', False)
495+

doc/source/whatsnew/v0.17.0.txt

+31
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ Highlights include:
4949
- Support for reading SAS xport files, see :ref:`here <whatsnew_0170.enhancements.sas_xport>`
5050
- Documentation comparing SAS to *pandas*, see :ref:`here <compare_with_sas>`
5151
- Removal of the automatic TimeSeries broadcasting, deprecated since 0.8.0, see :ref:`here <whatsnew_0170.prior_deprecations>`
52+
- Display format with plain text can optionally align with Unicode East Asian Width, see :ref:`here <whatsnew_0170.east_asian_width>`
5253
- Compatibility with Python 3.5 (:issue:`11097`)
5354
- Compatibility with matplotlib 1.5.0 (:issue:`11111`)
5455

@@ -334,6 +335,36 @@ Google BigQuery Enhancements
334335
- The ``generate_bq_schema()`` function is now deprecated and will be removed in a future version (:issue:`11121`)
335336
- Update the gbq module to support Python 3 (:issue:`11094`).
336337

338+
.. _whatsnew_0170.east_asian_width:
339+
340+
Display Alignemnt with Unicode East Asian Width
341+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
342+
343+
.. warning::
344+
345+
Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower).
346+
Use only when it is actually required.
347+
348+
Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If DataFrame or Series contains these characters, default output cannot be aligned properly. The following options are added to enable precise handling for these characters.
349+
350+
- ``display.unicode.east_asian_width``: Whether to use the Unicode East Asian Width to calculate the display text width. (:issue:`2612`)
351+
- ``display.unicode.ambiguous_as_wide``: Whether to handle Unicode characters belong to Ambiguous as Wide. (:issue:`11102`)
352+
353+
.. ipython:: python
354+
355+
df = pd.DataFrame({u'国籍': ['UK', u'日本'], u'名前': ['Alice', u'しのぶ']})
356+
df
357+
358+
pd.set_option('display.unicode.east_asian_width', True)
359+
df
360+
361+
For further details, see :ref:`here <options.east_asian_width>`
362+
363+
.. ipython:: python
364+
:suppress:
365+
366+
pd.set_option('display.unicode.east_asian_width', False)
367+
337368
.. _whatsnew_0170.enhancements.other:
338369

339370
Other enhancements

pandas/compat/__init__.py

+40
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
from itertools import product
3636
import sys
3737
import types
38+
from unicodedata import east_asian_width
3839

3940
PY2 = sys.version_info[0] == 2
4041
PY3 = (sys.version_info[0] >= 3)
@@ -90,6 +91,7 @@ def lmap(*args, **kwargs):
9091

9192
def lfilter(*args, **kwargs):
9293
return list(filter(*args, **kwargs))
94+
9395
else:
9496
# Python 2
9597
import re
@@ -176,6 +178,11 @@ class to receive bound method
176178
# The license for this library can be found in LICENSES/SIX and the code can be
177179
# found at https://bitbucket.org/gutworth/six
178180

181+
# Definition of East Asian Width
182+
# http://unicode.org/reports/tr11/
183+
# Ambiguous width can be changed by option
184+
_EAW_MAP = {'Na': 1, 'N': 1, 'W': 2, 'F': 2, 'H': 1}
185+
179186
if PY3:
180187
string_types = str,
181188
integer_types = int,
@@ -188,6 +195,20 @@ def u(s):
188195

189196
def u_safe(s):
190197
return s
198+
199+
def strlen(data, encoding=None):
200+
# encoding is for compat with PY2
201+
return len(data)
202+
203+
def east_asian_len(data, encoding=None, ambiguous_width=1):
204+
"""
205+
Calculate display width considering unicode East Asian Width
206+
"""
207+
if isinstance(data, text_type):
208+
return sum([_EAW_MAP.get(east_asian_width(c), ambiguous_width) for c in data])
209+
else:
210+
return len(data)
211+
191212
else:
192213
string_types = basestring,
193214
integer_types = (int, long)
@@ -204,6 +225,25 @@ def u_safe(s):
204225
except:
205226
return s
206227

228+
def strlen(data, encoding=None):
229+
try:
230+
data = data.decode(encoding)
231+
except UnicodeError:
232+
pass
233+
return len(data)
234+
235+
def east_asian_len(data, encoding=None, ambiguous_width=1):
236+
"""
237+
Calculate display width considering unicode East Asian Width
238+
"""
239+
if isinstance(data, text_type):
240+
try:
241+
data = data.decode(encoding)
242+
except UnicodeError:
243+
pass
244+
return sum([_EAW_MAP.get(east_asian_width(c), ambiguous_width) for c in data])
245+
else:
246+
return len(data)
207247

208248
string_and_binary_types = string_types + (binary_type,)
209249

pandas/core/common.py

+28-6
Original file line numberDiff line numberDiff line change
@@ -2149,28 +2149,50 @@ def _count_not_none(*args):
21492149

21502150

21512151

2152-
def adjoin(space, *lists):
2152+
def adjoin(space, *lists, **kwargs):
21532153
"""
21542154
Glues together two sets of strings using the amount of space requested.
21552155
The idea is to prettify.
2156-
"""
2156+
2157+
----------
2158+
space : int
2159+
number of spaces for padding
2160+
lists : str
2161+
list of str which being joined
2162+
strlen : callable
2163+
function used to calculate the length of each str. Needed for unicode
2164+
handling.
2165+
justfunc : callable
2166+
function used to justify str. Needed for unicode handling.
2167+
"""
2168+
strlen = kwargs.pop('strlen', len)
2169+
justfunc = kwargs.pop('justfunc', _justify)
2170+
21572171
out_lines = []
21582172
newLists = []
2159-
lengths = [max(map(len, x)) + space for x in lists[:-1]]
2160-
2173+
lengths = [max(map(strlen, x)) + space for x in lists[:-1]]
21612174
# not the last one
21622175
lengths.append(max(map(len, lists[-1])))
2163-
21642176
maxLen = max(map(len, lists))
21652177
for i, lst in enumerate(lists):
2166-
nl = [x.ljust(lengths[i]) for x in lst]
2178+
nl = justfunc(lst, lengths[i], mode='left')
21672179
nl.extend([' ' * lengths[i]] * (maxLen - len(lst)))
21682180
newLists.append(nl)
21692181
toJoin = zip(*newLists)
21702182
for lines in toJoin:
21712183
out_lines.append(_join_unicode(lines))
21722184
return _join_unicode(out_lines, sep='\n')
21732185

2186+
def _justify(texts, max_len, mode='right'):
2187+
"""
2188+
Perform ljust, center, rjust against string or list-like
2189+
"""
2190+
if mode == 'left':
2191+
return [x.ljust(max_len) for x in texts]
2192+
elif mode == 'center':
2193+
return [x.center(max_len) for x in texts]
2194+
else:
2195+
return [x.rjust(max_len) for x in texts]
21742196

21752197
def _join_unicode(lines, sep=''):
21762198
try:

pandas/core/config_init.py

+15
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,17 @@
144144
Deprecated.
145145
"""
146146

147+
pc_east_asian_width_doc = """
148+
: boolean
149+
Whether to use the Unicode East Asian Width to calculate the display text width
150+
Enabling this may affect to the performance (default: False)
151+
"""
152+
pc_ambiguous_as_wide_doc = """
153+
: boolean
154+
Whether to handle Unicode characters belong to Ambiguous as Wide (width=2)
155+
(default: False)
156+
"""
157+
147158
pc_line_width_deprecation_warning = """\
148159
line_width has been deprecated, use display.width instead (currently both are
149160
identical)
@@ -282,6 +293,10 @@ def mpl_style_cb(key):
282293
pc_line_width_doc)
283294
cf.register_option('memory_usage', True, pc_memory_usage_doc,
284295
validator=is_instance_factory([type(None), bool]))
296+
cf.register_option('unicode.east_asian_width', False,
297+
pc_east_asian_width_doc, validator=is_bool)
298+
cf.register_option('unicode.ambiguous_as_wide', False,
299+
pc_east_asian_width_doc, validator=is_bool)
285300

286301
cf.deprecate_option('display.line_width',
287302
msg=pc_line_width_deprecation_warning,

0 commit comments

Comments
 (0)