Skip to content

More Unicode, factor out pprinting of labels and names #2005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from Oct 12, 2012
Merged

More Unicode, factor out pprinting of labels and names #2005

merged 27 commits into from Oct 12, 2012

Conversation

ghost
Copy link

@ghost ghost commented Oct 2, 2012

addresses the same issue in 72ed09d applied to other occurences in the tree,
factored out the pprint code to a helper.

now supports pprinting of heterogeneous nested-nested-nested tuples with unicode strings,
FTW.

@ghost
Copy link
Author

ghost commented Oct 4, 2012

Ok, tests pass on 2.x and 3.x. I'd appreciate some feedback.

next step is to replace all stringify, strify and variants hroughout the
tree with calls to the new helpers defined at the bottom of format.py.

@ghost
Copy link
Author

ghost commented Oct 6, 2012

@jseabold - 2f013de removes one of your test, do you agree?

@ghost
Copy link
Author

ghost commented Oct 9, 2012

@wesm , can you give a general thumbs up/down on this PR?

it's getting a little big, and I'm hesitant to put in more effort unless
it has a good chance of being merged.

thanks

@jseabold
Copy link
Contributor

jseabold commented Oct 9, 2012

Can you describe at a high-level what you're doing, how it changes the API, workflow etc.? Or point me to where you've already done that? Would help me to think about the impact of the changes without worrying (yet) about implementation details.

@ghost
Copy link
Author

ghost commented Oct 9, 2012

sure.

what's this for
The general idea is to make pandas Unicode-friendly.

Unicode problems have been fixed many times before,but there are
still issues throughout the code base, and perhaps more importantly,
I think it's worthwhile to consolidate the previous fixes and general handling of
Unicode into a small, well-defined and documented set of functions to be used
everywhere.

PR #1994 already picked some low-hanging fruit, mainly having to do with
isinstance(str) checks which should have been isinstance(basestring).

This PR focuses on consolidating the code for the general task of pprinting
objects throughout pandas. 9e45f252d9 defines two workhorse functions which
tackle the pprinting of objects, and the rest of the commits in the series modify
existing functions throughout the tree to use them and add tests to document
previously broken behavior. the next series would get rid of the stringify*/strify
functions to use this (to-be) sanctioned API for pprinting as well.

These changes have a few merits:

  1. They localize (somewhat) the code which needs to do the "right"
    thing as far as Unicode is concerned. currently there's a lot of ad-hoc
    fixes which have built-up to fix previous bugs, this gave rise to code
    duplication and corner-cases which are not handled properly.

  2. They makes clearer the distinction between text and Unicode/str.
    the conflation of these two gives rise to most of the Unicode bugs
    I've seen so far. it makes it easier to reason about py3 compat as well.
    the former convention of using functions named str* is unfortunate
    because they are ambiguous. it would be an improvement to designate
    new function which emphasize the distinction between text and encoded
    strings/bytes

  3. so far, nothing should break (unless it's "plain wrong").

regarding the test removed by 65044e844ee88d98, I can see why it's useful
behavior for someone of that locale, but it means exceptions for other people.
if a default encoding is assumed, I can't see how it could be anything other
then utf-8, but currently the default encoding scattered in the code is "None".
chardet might give universal encoding detection ( it seems
to require longish input to work, but I might be wrong ), but until that's
implemented, or 'utf-8' becomes officially the default, IMO 65044e84 just
verifies that the wrong thing happens . (btw, I'm all for a utf-8 default encoding,
if it's done consistently, see PR #2006).

  1. The new pprint functions handle more cases gracefully, so that
    pandas feels friendlier to users outside the latin-1 world, or just
    people who use strange code-points in their labels.

an example of the problem:

In [130]: repr(u'\u03bb',)
Out[130]: "(u'\u03bb',)"

but a user would rather see something like:

In [130]: repr(u'\u03bb',)
Out[130]: (λ,)

...at least on consoles that support utf-8 or some suitable encoding,
most notably ipython qtconsole/notebook.
the code in this PR does that, and relies on print_config.encoding
(with improved detection burrowed from ipython) as the encoding for
things sent to the console.

some future PR might require API changes, which can be discussed then.
mainly around the question of mandatory encoding vs default encoding
vs. guessing an encoding.

there are still more things to do, if one of the commiters gives an ack on this,
I can go ahead knowing that i'm not wasting effort down this path.

# fix for IPython zmq frontends
try:
get_ipython() # ths will succeed under IPython
encoding = 'utf=8'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'utf-8'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange,unicode.encode accepts that for some bizzare reason.
will fix.

@wesm
Copy link
Member

wesm commented Oct 10, 2012

This all looks perfectly in the right direction-- as you can see I never mustered the nerve to do this cleanup myself. Carry on

y-p and others added 20 commits October 11, 2012 22:47
I think this is the wrong behaviour, and it breaks some
future unicode fixes. the constructor should  should complain
that no encoding was specified when the input is not ascii.
@ghost
Copy link
Author

ghost commented Oct 12, 2012

As this series is self-contained and makes no breaking changes,
I suggest that this be merged as-is, and further work will take
place in a seperate PR.

@wesm wesm merged commit 5fa2ae4 into pandas-dev:master Oct 12, 2012
@wesm
Copy link
Member

wesm commented Oct 12, 2012

Good to go; merged into master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants