-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Data formatting with unicode length #11102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How much slower is this than the default behavior? My guess is that anyone who uses these characters will want this. Usually we're not printing enough data to the screen for performance on this sort of stuff to matter. |
|
||
def east_asian_len(data, encoding=None): | ||
""" | ||
Calcurate display width considering unicode East Asian Width |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo - calculate
|
1db82da
to
5c84786
Compare
OK, this PR should work all cases which I'm aware of. Appreciated if anyone provide further test cases if any concerns. @shoyer Yes, east-asian prefer this to be default
The affect is almost the same as all ascii (same condition except for characters are all ascii):
@kawochen I may not properly understand, but reducing East Asian Width category will not affect to the performance because dict lookup is O(1).
Do you have any idea about affected characters and special handling logic? My concern is these characters can not be aligned properly even if we tried so. CC: @ayapi |
If I understand correctly, because pandas does not actually print every element of large dataframes, so printing a larger DataFrame would be the same speed? How does this patch effect the speed of printing Unicode text if it only contains ASCII characters? |
@shoyer Correct. Larger data can be printed almost the same speed. Perf is not affected by data size once it exceeds Result for all ascii are described in above. Almost 2 times longer, because the logic is the same. It can be short-passed if we can distinguish whether input has 2 bytes char or not including symbols. Is it possible using any built-in?. |
I would probably try tweaking the string length check and seeing if a simple variant can give you a speed up. But generally this is pretty reasonable already. Few dataframes will show this many strings. On Sat, Sep 26, 2015 at 8:16 PM, Sinhrks [email protected] wrote:
|
@sinhrks Yes the optimization I mentioned is so minor I don't know why I brought it up. It makes |
else: | ||
name = strlen.__name__ | ||
|
||
if name == 'east_asian_len': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass a different pad function if you are in east_asian_len
mode, then then you don't need the if/then block that duplicates code
7a03703
to
7de4215
Compare
@jreback I've refactored a little based on your comments. I've created @kawochen OK, I leave current Also, please provide the list of ambiguous characters and its width. As long as I understand, these ambiguous character widths are not integral multiple, thus these cannot be aligned by padding with white spaces. I don't think we can fix it because it is unicode spec. |
looks nice! can u run a perf check on relevant benchmarks to assert its about the same as current? add a release note and can do for 0.17.0 |
@sinhrks ambiguous characters are either wide or narrow. see here for list http://unicode.org/reports/tr11-2/ |
@kawochen Ok. Backed to first discussion, it can't be support it without clarifying the logic. I can't find the logic per characters from your link.... Maybe I misunderstood sonething? Pls provide actual code works as your expectation. |
@sinhrks There is no per character logic. All ambiguous characters are narrow on my terminal, but that's just my settings. Users should know whether it should be treated as wide or narrow, which depends on where the characters are being printed, so making it configurable might make sense. I don't think that can be figured out from within Python. You can try printing the characters in the list. |
Thanks. Maybe I could understand a little. From the link:
When mapping Unicode to East Asian legacy character encodings: Ambiguous should be handled ad full-width (length=2). It can be covered by the impl added by the PR ( When mapping Unicode to non-East Asian legacy character encodings: Because full width are not mapped (cannot appear), all characters including ambiguous can be regarded as half-width (length=1). It should be corresponding to the current default What the situation requires a separate option only for Ambiguous? |
@sinhrks when we are not mapping to legacy encodings. For example in my terminal Chinese characters are twice as wide as ambiguous characters. Whether it should be 1 or 2 depends on where and how the data will be displayed. In the unlikely case when we wanted it to look pretty in Arial then we'd print tabs to align stuff. In mono space fonts it's easier so we can use spaces. In modern terminals ambiguous characters are usually narrow. |
@kawochen Can you describe with screenshots and code which I can confirm on my terminal? |
Understood, how about the name |
75763af
to
8a442b7
Compare
Updated to add release note and terminalLatter case cannot be aligned because of the mismatch between terminal and pandas option. Sphinx/JupyterGoing to add note to say "This should be aligned properly in terminal which uses monospaced font." I'll update perf comparison tomorrow. |
8a442b7
to
9ad73fd
Compare
9ad73fd
to
53ac289
Compare
@@ -304,7 +305,7 @@ See the :ref:`documentation <io.excel>` for more details. | |||
:suppress: | |||
|
|||
import os | |||
os.remove('test.xlsx') | |||
# os.remove('test.xlsx') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
@sinhrks ok some minor doc fixes. merge when ready. |
53ac289
to
f2a9880
Compare
f2a9880
to
2a96074
Compare
Thanks, updated doc and asv result attached. Results looks random.
|
@jreback Could you merge this when you prepare RC? All: I'm willing to fix if anything is pointed out during RC. |
ENH: Data formatting with unicode length
Closes #2612. Added
display.unicode.east_asian_width
options, which calculate text width considering East Asian Width. Enabling this option affects to a performance as width must be calculated per characters.Current results (captured)