ENH: Data formatting with unicode length #11102

sinhrks · 2015-09-15T12:38:20Z

Closes #2612. Added display.unicode.east_asian_width options, which calculate text width considering East Asian Width. Enabling this option affects to a performance as width must be calculated per characters.

Current results (captured)

Basic impl and test
Series / DataFrame truncation
Perf test
Doc / Release note

shoyer · 2015-09-16T17:42:08Z

How much slower is this than the default behavior? My guess is that anyone who uses these characters will want this.

Usually we're not printing enough data to the screen for performance on this sort of stuff to matter.

kawochen · 2015-09-16T18:37:22Z

pandas/compat/__init__.py

+
+    def east_asian_len(data, encoding=None):
+        """
+        Calcurate display width considering unicode East Asian Width


typo - calculate

kawochen · 2015-09-16T20:09:17Z

east_asian_len calculation can be reduced to a 3-element set membership testing. Ever so slightly faster but probably too micro to matter. I think ambiguous characters need special handling (?).

sinhrks · 2015-09-27T02:04:31Z

OK, this PR should work all cases which I'm aware of. Appreciated if anyone provide further test cases if any concerns.

@shoyer Yes, east-asian prefer this to be default True. But it is almost 2 times slower in below case.

DataFrame contains 10000 data, 100 rows * 100 columns, each item contains 10 Unicode chars
Display options are default:
- pd.options.display.max_rows: 60
- pd.options.display.max_columns: 20

import numpy as np
import pandas as pd

chars = list(u'あいうえおかきくけこさしすせそたちつてとなにぬねの')

def rand_jp(x):
    return ''.join(np.random.choice(chars) for _ in range(x))

df = pd.DataFrame(np.empty((100, 100)))
df = df.applymap(lambda x: rand_jp(10))

%timeit unicode(df)
# 10 loops, best of 3: 177 ms per loop

# Enable Unicode handling
pd.options.display.unicode.east_asian_width = True

%timeit unicode(df)
# 1 loops, best of 3: 381 ms per loop

The affect is almost the same as all ascii (same condition except for characters are all ascii):

Default: 10 loops, best of 3: 131 ms per loop
Enable Unicode handling: 1 loops, best of 3: 302 ms per loop

@kawochen I may not properly understand, but reducing East Asian Width category will not affect to the performance because dict lookup is O(1).

I think ambiguous characters need special handling (?).

Do you have any idea about affected characters and special handling logic? My concern is these characters can not be aligned properly even if we tried so.

CC: @ayapi

shoyer · 2015-09-27T02:44:33Z

If I understand correctly, because pandas does not actually print every element of large dataframes, so printing a larger DataFrame would be the same speed?

How does this patch effect the speed of printing Unicode text if it only contains ASCII characters?

sinhrks · 2015-09-27T03:16:09Z

@shoyer Correct. Larger data can be printed almost the same speed. Perf is not affected by data size once it exceeds max_columns and max_rows.

Result for all ascii are described in above. Almost 2 times longer, because the logic is the same. It can be short-passed if we can distinguish whether input has 2 bytes char or not including symbols. Is it possible using any built-in?.

shoyer · 2015-09-27T05:47:39Z

I would probably try tweaking the string length check and seeing if a simple variant can give you a speed up. But generally this is pretty reasonable already. Few dataframes will show this many strings.

On Sat, Sep 26, 2015 at 8:16 PM, Sinhrks [email protected] wrote:

@shoyer Correct. Larger data can be printed almost the same speed. Perf is not affected by data size once it exceeds max_columns and max_rows.

Result for all ascii are described in above. Almost 2 times longer, because the logic is the same. It can be short-passed if we can distinguish whether input has 2 bytes char or not including symbols. Is it possible using any built-in?.

Reply to this email directly or view it on GitHub:
#11102 (comment)

kawochen · 2015-09-27T14:15:38Z

@sinhrks Yes the optimization I mentioned is so minor I don't know why I brought it up. It makes east_asian_len take about 30~40% less time to run, but I don't think this is where time is spent anyways. Regarding ambiguous characters I am not sure what can be done (perhaps config/options)? Also wanted to mention in passing that printing ⟼ (neutral) would still not print as one would hope.

jreback · 2015-09-27T14:23:37Z

pandas/core/common.py

+    else:
+        name = strlen.__name__
+
+    if name == 'east_asian_len':


pass a different pad function if you are in east_asian_len mode, then then you don't need the if/then block that duplicates code

sinhrks · 2015-10-01T14:40:07Z

@jreback I've refactored a little based on your comments. I've created TextAdjustment class which has a set of east-asian depending function rather than defining separate pad function. Because separate definitions may causes unexpected results in future (e.g. using pad for east-asian with normal len ). If looks OK, I'll squash.

@kawochen OK, I leave current east_asian_len.

Also, please provide the list of ambiguous characters and its width. As long as I understand, these ambiguous character widths are not integral multiple, thus these cannot be aligned by padding with white spaces. I don't think we can fix it because it is unicode spec.

jreback · 2015-10-01T14:46:28Z

looks nice!

can u run a perf check on relevant benchmarks to assert its about the same as current?

add a release note and can do for 0.17.0

kawochen · 2015-10-01T16:36:19Z

@sinhrks ambiguous characters are either wide or narrow. see here for list http://unicode.org/reports/tr11-2/

sinhrks · 2015-10-01T22:08:56Z

@kawochen Ok. Backed to first discussion, it can't be support it without clarifying the logic. I can't find the logic per characters from your link.... Maybe I misunderstood sonething? Pls provide actual code works as your expectation.

kawochen · 2015-10-01T23:05:47Z

@sinhrks There is no per character logic. All ambiguous characters are narrow on my terminal, but that's just my settings. Users should know whether it should be treated as wide or narrow, which depends on where the characters are being printed, so making it configurable might make sense. I don't think that can be figured out from within Python. You can try printing the characters in the list.

sinhrks · 2015-10-01T23:50:27Z

Thanks. Maybe I could understand a little. From the link:

When mapping Unicode to East Asian legacy character encodings

Wide Unicode characters always map to fullwidth characters.

Narrow (and neutral) Unicode characters always map to halfwidth characters.

Halfwidth Unicode characters always map to halfwidth characters.

Ambiguous Unicode characters always map to fullwidth characters.

When mapping Unicode to non-East Asian legacy character encodings

Wide Unicode characters do not map to non-East Asian legacy character encodings.

Narrow (and neutral) Unicode characters always map to regular (narrow) characters.

Halfwidth Unicode characters do not map.

Ambiguous Unicode characters always map to regular (narrow) characters.

When mapping Unicode to East Asian legacy character encodings: Ambiguous should be handled ad full-width (length=2). It can be covered by the impl added by the PR (display.unicode.east_asian_width=True) .

When mapping Unicode to non-East Asian legacy character encodings: Because full width are not mapped (cannot appear), all characters including ambiguous can be regarded as half-width (length=1). It should be corresponding to the current default display.unicode.east_asian_width=False.

What the situation requires a separate option only for Ambiguous?

kawochen · 2015-10-02T00:29:44Z

@sinhrks when we are not mapping to legacy encodings. For example in my terminal Chinese characters are twice as wide as ambiguous characters. Whether it should be 1 or 2 depends on where and how the data will be displayed. In the unlikely case when we wanted it to look pretty in Arial then we'd print tabs to align stuff. In mono space fonts it's easier so we can use spaces. In modern terminals ambiguous characters are usually narrow.

sinhrks · 2015-10-02T00:52:59Z

@kawochen Can you describe with screenshots and code which I can confirm on my terminal?

kawochen · 2015-10-02T00:59:50Z

In [2]: print('中文\n\u00A1\u00A1ab')
In [4]: east_asian_width('\u00A1')
Out[4]: 'A'

So if I have all of those in a DataFrame, I might have too much padding.

sinhrks · 2015-10-02T03:13:06Z

Understood, how about the name unicode.ambiguous_as_wide with default False (length=1) if it is popular in current terminal. If specified True, its length is regarded as 2.

sinhrks · 2015-10-02T15:19:28Z

Updated to add release note and unicode.ambiguous_as_wide. "Ambiguous" characters cannot be aligned properly on Sphinx (Jupyter).

terminal

Latter case cannot be aligned because of the mismatch between terminal and pandas option.

Sphinx/Jupyter

Going to add note to say "This should be aligned properly in terminal which uses monospaced font."

I'll update perf comparison tomorrow.

jreback · 2015-10-02T21:58:15Z

doc/source/whatsnew/v0.17.0.txt

@@ -304,7 +305,7 @@ See the :ref:`documentation <io.excel>` for more details.
   :suppress:

   import os
-   os.remove('test.xlsx')
+   # os.remove('test.xlsx')


jreback · 2015-10-02T22:01:48Z

@sinhrks ok some minor doc fixes. merge when ready.

sinhrks · 2015-10-03T03:00:27Z

Thanks, updated doc and asv result attached. Results looks random.

All benchmarks:

    before     after       ratio
  [5049b5  ] [53ac28  ]
    19.50ms    22.42ms      1.15  frame_methods.frame_repr_tall.time_frame_repr_talld
     1.30ms     1.43ms      1.10  groupby.series_value_counts.time_value_counts_int64
     3.41μs     3.74μs      1.10  indexing.indexing_frame_get_value.time_indexing_frame_get_value

sinhrks · 2015-10-03T03:02:42Z

@jreback Could you merge this when you prepare RC?

All: I'm willing to fix if anything is pointed out during RC.

ENH: Data formatting with unicode length

jreback · 2015-10-03T14:41:25Z

@sinhrks (and @kawochen ) this is fantastic. quite a bit of work and lots of tests yeh!

sinhrks · 2015-10-03T22:24:44Z

Found doc layout can differ by environment. I'll update notes to describe it.

1st looks OK, 2nd NG

1st looks NG, 2nd OK

sinhrks added Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings labels Sep 15, 2015

sinhrks added this to the 0.17.1 milestone Sep 15, 2015

kawochen reviewed Sep 16, 2015
View reviewed changes

sinhrks force-pushed the unicode_justify branch 3 times, most recently from 1db82da to 5c84786 Compare September 27, 2015 01:47

jreback reviewed Sep 27, 2015
View reviewed changes

sinhrks force-pushed the unicode_justify branch 4 times, most recently from 7a03703 to 7de4215 Compare October 1, 2015 13:01

sinhrks force-pushed the unicode_justify branch 2 times, most recently from 75763af to 8a442b7 Compare October 2, 2015 15:09

sinhrks force-pushed the unicode_justify branch from 8a442b7 to 9ad73fd Compare October 2, 2015 15:21

sinhrks modified the milestones: 0.17.0, 0.17.1 Oct 2, 2015

sinhrks force-pushed the unicode_justify branch from 9ad73fd to 53ac289 Compare October 2, 2015 15:57

jreback reviewed Oct 2, 2015
View reviewed changes

sinhrks force-pushed the unicode_justify branch from 53ac289 to f2a9880 Compare October 3, 2015 01:28

ENH: Data formatting with unicode length

2a96074

sinhrks force-pushed the unicode_justify branch from f2a9880 to 2a96074 Compare October 3, 2015 02:10

sinhrks changed the title ~~(WIP) ENH: Data formatting with unicode length~~ ENH: Data formatting with unicode length Oct 3, 2015

jreback added a commit that referenced this pull request Oct 3, 2015

Merge pull request #11102 from sinhrks/unicode_justify

75cd3e8

ENH: Data formatting with unicode length

jreback merged commit 75cd3e8 into pandas-dev:master Oct 3, 2015

sinhrks deleted the unicode_justify branch October 3, 2015 17:37

sinhrks mentioned this pull request Oct 3, 2015

DOC: Add note about unicode layout #11231

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Data formatting with unicode length #11102

ENH: Data formatting with unicode length #11102

sinhrks commented Sep 15, 2015

shoyer commented Sep 16, 2015

kawochen Sep 16, 2015

kawochen commented Sep 16, 2015

sinhrks commented Sep 27, 2015

shoyer commented Sep 27, 2015

sinhrks commented Sep 27, 2015

shoyer commented Sep 27, 2015

Result for all ascii are described in above. Almost 2 times longer, because the logic is the same. It can be short-passed if we can distinguish whether input has 2 bytes char or not including symbols. Is it possible using any built-in?.

kawochen commented Sep 27, 2015

jreback Sep 27, 2015

sinhrks commented Oct 1, 2015

jreback commented Oct 1, 2015

kawochen commented Oct 1, 2015

sinhrks commented Oct 1, 2015

kawochen commented Oct 1, 2015

sinhrks commented Oct 1, 2015

kawochen commented Oct 2, 2015

sinhrks commented Oct 2, 2015

kawochen commented Oct 2, 2015

sinhrks commented Oct 2, 2015

sinhrks commented Oct 2, 2015

jreback Oct 2, 2015

jreback commented Oct 2, 2015

sinhrks commented Oct 3, 2015

sinhrks commented Oct 3, 2015

jreback commented Oct 3, 2015

sinhrks commented Oct 3, 2015

ENH: Data formatting with unicode length #11102

ENH: Data formatting with unicode length #11102

Conversation

sinhrks commented Sep 15, 2015

Current results (captured)

shoyer commented Sep 16, 2015

kawochen Sep 16, 2015

Choose a reason for hiding this comment

kawochen commented Sep 16, 2015

sinhrks commented Sep 27, 2015

shoyer commented Sep 27, 2015

sinhrks commented Sep 27, 2015

shoyer commented Sep 27, 2015

Result for all ascii are described in above. Almost 2 times longer, because the logic is the same. It can be short-passed if we can distinguish whether input has 2 bytes char or not including symbols. Is it possible using any built-in?.

kawochen commented Sep 27, 2015

jreback Sep 27, 2015

Choose a reason for hiding this comment

sinhrks commented Oct 1, 2015

jreback commented Oct 1, 2015

kawochen commented Oct 1, 2015

sinhrks commented Oct 1, 2015

kawochen commented Oct 1, 2015

sinhrks commented Oct 1, 2015

kawochen commented Oct 2, 2015

sinhrks commented Oct 2, 2015

kawochen commented Oct 2, 2015

sinhrks commented Oct 2, 2015

sinhrks commented Oct 2, 2015

terminal

Sphinx/Jupyter

jreback Oct 2, 2015

Choose a reason for hiding this comment

jreback commented Oct 2, 2015

sinhrks commented Oct 3, 2015

sinhrks commented Oct 3, 2015

jreback commented Oct 3, 2015

sinhrks commented Oct 3, 2015

1st looks OK, 2nd NG

1st looks NG, 2nd OK