More Unicode, factor out pprinting of labels and names #2005

ghost · 2012-10-02T00:39:30Z

addresses the same issue in 72ed09d applied to other occurences in the tree,
factored out the pprint code to a helper.

now supports pprinting of heterogeneous nested-nested-nested tuples with unicode strings,
FTW.

ghost · 2012-10-04T21:27:26Z

Ok, tests pass on 2.x and 3.x. I'd appreciate some feedback.

next step is to replace all stringify, strify and variants hroughout the
tree with calls to the new helpers defined at the bottom of format.py.

ghost · 2012-10-06T05:47:56Z

@jseabold - 2f013de removes one of your test, do you agree?

ghost · 2012-10-09T19:16:35Z

@wesm , can you give a general thumbs up/down on this PR?

it's getting a little big, and I'm hesitant to put in more effort unless
it has a good chance of being merged.

thanks

jseabold · 2012-10-09T20:34:07Z

Can you describe at a high-level what you're doing, how it changes the API, workflow etc.? Or point me to where you've already done that? Would help me to think about the impact of the changes without worrying (yet) about implementation details.

ghost · 2012-10-09T22:15:15Z

sure.

what's this for
The general idea is to make pandas Unicode-friendly.

Unicode problems have been fixed many times before,but there are
still issues throughout the code base, and perhaps more importantly,
I think it's worthwhile to consolidate the previous fixes and general handling of
Unicode into a small, well-defined and documented set of functions to be used
everywhere.

PR #1994 already picked some low-hanging fruit, mainly having to do with
isinstance(str) checks which should have been isinstance(basestring).

This PR focuses on consolidating the code for the general task of pprinting
objects throughout pandas. 9e45f252d9 defines two workhorse functions which
tackle the pprinting of objects, and the rest of the commits in the series modify
existing functions throughout the tree to use them and add tests to document
previously broken behavior. the next series would get rid of the stringify*/strify
functions to use this (to-be) sanctioned API for pprinting as well.

These changes have a few merits:

They localize (somewhat) the code which needs to do the "right"
thing as far as Unicode is concerned. currently there's a lot of ad-hoc
fixes which have built-up to fix previous bugs, this gave rise to code
duplication and corner-cases which are not handled properly.
They makes clearer the distinction between text and Unicode/str.
the conflation of these two gives rise to most of the Unicode bugs
I've seen so far. it makes it easier to reason about py3 compat as well.
the former convention of using functions named str* is unfortunate
because they are ambiguous. it would be an improvement to designate
new function which emphasize the distinction between text and encoded
strings/bytes
so far, nothing should break (unless it's "plain wrong").

regarding the test removed by 65044e844ee88d98, I can see why it's useful
behavior for someone of that locale, but it means exceptions for other people.
if a default encoding is assumed, I can't see how it could be anything other
then utf-8, but currently the default encoding scattered in the code is "None".
chardet might give universal encoding detection ( it seems
to require longish input to work, but I might be wrong ), but until that's
implemented, or 'utf-8' becomes officially the default, IMO 65044e84 just
verifies that the wrong thing happens . (btw, I'm all for a utf-8 default encoding,
if it's done consistently, see PR #2006).

The new pprint functions handle more cases gracefully, so that
pandas feels friendlier to users outside the latin-1 world, or just
people who use strange code-points in their labels.

an example of the problem:

In [130]: repr(u'\u03bb',)
Out[130]: "(u'\u03bb',)"

but a user would rather see something like:

In [130]: repr(u'\u03bb',)
Out[130]: (λ,)

...at least on consoles that support utf-8 or some suitable encoding,
most notably ipython qtconsole/notebook.
the code in this PR does that, and relies on print_config.encoding
(with improved detection burrowed from ipython) as the encoding for
things sent to the console.

some future PR might require API changes, which can be discussed then.
mainly around the question of mandatory encoding vs default encoding
vs. guessing an encoding.

there are still more things to do, if one of the commiters gives an ack on this,
I can go ahead knowing that i'm not wasting effort down this path.

wesm · 2012-10-10T15:10:20Z

pandas/core/format.py

+            # fix for IPython zmq frontends
+            try:
+                get_ipython() # ths will succeed under IPython
+                encoding = 'utf=8'


strange,unicode.encode accepts that for some bizzare reason.
will fix.

wesm · 2012-10-10T15:14:29Z

This all looks perfectly in the right direction-- as you can see I never mustered the nerve to do this cleanup myself. Carry on

I think this is the wrong behaviour, and it breaks some future unicode fixes. the constructor should should complain that no encoding was specified when the input is not ascii.

fixes #2051

…sole_encode()

…x_name

…t objects

…at()

result should not change, unless unicode is present.

ghost · 2012-10-12T13:12:01Z

As this series is self-contained and makes no breaking changes,
I suggest that this be merged as-is, and further work will take
place in a seperate PR.

wesm · 2012-10-12T23:36:15Z

Good to go; merged into master

ghost mentioned this pull request Oct 2, 2012

always use UnicodeWriter for csv, default to utf-8 #2006

Closed

wesm reviewed Oct 10, 2012
View reviewed changes

ghost mentioned this pull request Oct 11, 2012

BUG: fix Series repr when name is tuple holding non string-type #2059

Closed

y-p and others added 20 commits October 11, 2012 22:47

BUG: index.format should accept unicode index names

03401c2

CLN: Move test_console_encode out of wrong test class

7fdeccb

TST: unless a file is pure ascii, you must specify an encoding

2c086e4

TST: remove fmt.test_to_string_force_unicode

5d03a6b

I think this is the wrong behaviour, and it breaks some future unicode fixes. the constructor should should complain that no encoding was specified when the input is not ascii.

CLN: Move _is_sequence() from pd.frame to pd.common, with other is_*

d3c062b

TST: add test for _is_sequence()

4c337f5

ENH: rework console encoding detection in fmt.print_config

cbeff93

ENH: Add helpers to pd.common: pprint_thing/_encoded(),console_encode()

d859d15

TST: add test_pprint_thing()

fb3e8e2

TST: Series repr fails when name is tuple holding non string-type #2051

11dff0d

ENH: SeriesFormatter footer repr now uses pprint_thing()

17d9c12

fixes #2051

ENH: explicitly encode retval of SeriesFormatter.to_string() with con…

15a78cf

…sole_encode()

ENH: Index summary() and format() now delegate to pprint_thing()

55b4631

ENH: tseries.Index.summary() now delegates to pprint_thing()

f24f772

BUG: TextReader._explicit_index_names() should allow for unicode inde…

00f2a97

…x_name

BUG: parsers._concat_date_cols should accept unicode

0e11730

TST: test dataframe to_csv() with unicode index and columns

6a197ce

TST: test series to_csv() with unicode index

a9896a6

BUG: csvwriter writerow() now delegates to pprint_thing() for non-tex…

c9c0f95

…t objects

TST: add test for UnicodeWriter with csv.QUOTE_NONNUMERIC

c907f6f

y-p added 7 commits October 11, 2012 22:47

ENH: add is_number() helper to pd.core.common

2e1001d

ENH: UnicodeWriter (CSV) now supports quoting=csv.QUOTE_NONNUMERIC

d115c86

CLN: Expunge stringify_seq() in favor of pprint_thing() in Index.form…

6677d39

…at()

BUG: Add checks to df,series repr() to handle python3

7567b65

TST: repr() should return type str() on py2 and py3

d968508

ENH: Index.__repr__ now uses pprint_thing/_encoded().

95678eb

result should not change, unless unicode is present.

CLN: Abolish stringify and _strify in favor of pprint_thing()

5fa2ae4

ghost mentioned this pull request Oct 11, 2012

Configurability of unicode/console encoding #1654

Closed

wesm merged commit 5fa2ae4 into pandas-dev:master Oct 12, 2012

ghost mentioned this pull request Oct 22, 2012

Unicode III : revenge of the character planes #2104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Unicode, factor out pprinting of labels and names #2005

More Unicode, factor out pprinting of labels and names #2005

ghost commented Oct 2, 2012

ghost commented Oct 4, 2012

ghost commented Oct 6, 2012

ghost commented Oct 9, 2012

jseabold commented Oct 9, 2012

ghost commented Oct 9, 2012

wesm Oct 10, 2012

ghost Oct 10, 2012

wesm commented Oct 10, 2012

ghost commented Oct 12, 2012

wesm commented Oct 12, 2012

More Unicode, factor out pprinting of labels and names #2005

More Unicode, factor out pprinting of labels and names #2005

Conversation

ghost commented Oct 2, 2012

ghost commented Oct 4, 2012

ghost commented Oct 6, 2012

ghost commented Oct 9, 2012

jseabold commented Oct 9, 2012

ghost commented Oct 9, 2012

wesm Oct 10, 2012

Choose a reason for hiding this comment

ghost Oct 10, 2012

Choose a reason for hiding this comment

wesm commented Oct 10, 2012

ghost commented Oct 12, 2012

wesm commented Oct 12, 2012