Unicode III : revenge of the character planes #2104

ghost · 2012-10-22T18:27:53Z

This is the 3rd installment in the unicode saga, and once again,
there's a lot of issues to point out. fair warning given.
#1994 cleaned up things to use basestring and unicode() over str() where needed
#2005 consolidated and implemented unicode-friendly pretty-printing

The attempt here is to ensure that the internal representation of strings is
always a unicode object, as is the case by default in python3.

It's not finished, but I'd like to get feedback.

A.Implementation

core.encoding implements the functionality for traversing sequences and decoding
strings into unicode objects inplace. this includes key/values of dicts (one
of the many supported input structures for constructing pandas data objects).
pure ascii strings are left untouched.
For strings containing unicode characters, we do actually change the input,
but since getters are symmetricly handled, the result should be seamless (if you
set a key using an encoded bytestrings,invoking a getter with the same key should
get you back the data, even though internally, the data is stored under the equiv.
unicode key).
this decoding step is placed in "choke-points" such as constructors, getters and
setters. I've probably missed a few at this point.
the encoding to be used for decoding is specified in a configurable (which will
be settable by the user, possibly via the mechanism in PR: adding a core.config module to hold package-wide configurables #2097.
if decoding fails, the user gets a message explaining what happend. if the chardet
package is available, the code tries to suggest an encoding to the user based
on the data. chardet has enough false-positives to make automatic-detection
a bad idea.
as part of the unicode-or-bust theme, the csv reader now uses utf-8 by default,
which forces the use of UnicodeReader, so you either get unicode or a decode error,
forcing the user to specify an encoding.

B. Benefits

Making internal representations unicode should help reduce the whack-a-mole
which has been going on.
Once things are guranteed unicode, lib code can stop checking for and handling
corner-cases.
Mixing unicode and encoded byte-strings, or bytestrings encoded with
different encodings can create insoluable situations.
A related example is fmt._put_lines(), which was altered to fix DataFrame.to_html encoding #891 so
it would handle the case of mixing pure ascii with unicode.However, it now fails
when unicode is mixed with encoded byte-strings (which makes repr_html() fail,
yielding sometimes-html-sometimes-text repr confusion). This PR fixes that problem
amongst other things.
since python encodes input from the console using the console encoding, we can
convert things to unicode transparently for the user, so the behaviour becomes
closer to that of python3 - enter a string at the console, and you get unicode by
default. (no need for unicode literal).

C. Disadvantages

It's hackish.
There's a performance hit (@wesm, what benchmarks would you like to see?)
Immutable sequences are currently not handled. not just because of copying, but
also because an experiment converting tuples raised all sorts of mysterious
problems, in cython code amongst other things.
Probably difficult to cover every possible entry-point for bytes-strings.
Not sure about c-parser compat. issues.

D. Points to consider (feedback welcome)

there will be a configurable to turn all of this on or off. Not sure what the
default should be though.
Should the encoding used for decoding be specifiable per-object? or just
a global configurable available to the user.
what's the best way to deal with Immutable sequences?

The configurables will be sorted out later (Pending #2097?).

`Index` inherits from np.ndarray, yet implements an immutable datatype, Pythonic duck-typing would suggest that the presence of __setitem__ signals a mutable datatype, while the overriden implementaion just raises an exception - a bait and switch. Python does not offer a really clean way to eliminate an attribute inherited from a superclass, but overriding __getattribute__ gives us the same end result. squash with 218fe0a

…encoding

…bytes.

…re mixed This case can cause repr_html to fail.

…icode

Now that internal representation is unicode, we can make things more streamlined, the code here which assumes all bytestrings are utf-8 encoded can be trimmed.

Incroporate into Getters,Setters and constructors

wesm · 2012-10-22T19:32:17Z

I haven't looked carefully yet, but it looks like this is decoding all strings (e.g. those contained in a Series or DataFrame index) to Unicode? The cost of this in Python 2.x would be pretty extraordinary if so. Have to look some more when I have some more time.

ghost · 2012-10-22T19:48:02Z

yes, this one is a lot more aggresive then the previous two. perhaps too much.
could be just as an option disabled by default.

Doesn't python3 have to do just as much decoding upfront? (Although without traversal)

wesm · 2012-10-22T21:52:00Z

Probably, but most people running pandas in production where performance (and memory use) matters are on 2.x. Foisting unicode on them seems like probably an undue burden. Making it globally configurable may not be a bad idea though. I would need to do some testing to see, anyway

I'll have to think about what to do in the c-parser branch, because Python3 and unicode handling there is a bit of an unknown. I think the fast-parser is going to have to be bytes-only for a while until a mysterious superhero wants to do a lot of C hacking with unicode. The performance parsing bytes is going to be the maximum, and that represents the majority of real-world use cases. Perhaps only parsing unicode via the pure-Python interface for now makes sense

ghost · 2012-10-23T05:06:47Z

Fair enough.

I use pandas with modest datasets where unicode is a problem and perf.
is not. So having such an option would be a win.

IMO, for pandas to be "Unicode-safe" something like this is needed.
The responsebility for being consistent can be put on the user (want unicode?
pass in unicode) or enforced by something like this. But the code should stop
trying to support mixing unicode with encoded byte-strings.

The code actually does more work then strictly needed, because
it tries to preserve pure ascii as bytes. I have to think more if that's a good
or bad idea.

If #2097 or something to that end is merged, I'll make sure everything can be turned
on/off by the user. and the default can be chosen after doing some benchmarks
(but it looks like off is the way to go).

Let me know what changes you think are needed.

ghost · 2012-10-23T06:15:37Z

Regarding c-parser, is it bytes-only or ascii-only? I'm assuming you're still calling
into the python constructors to actually build the pandas objects, so if the parser can handle utf-8
bytes, It may actually not be a challange to wrap (again - aside from performance).

wesm · 2012-10-23T18:38:19Z

The tokenizer is written assuming you are receiving chunks of data as char*.

ghost · 2012-10-24T05:10:47Z

@wesm, I think this might be easier then you think. The way you've built this is very flexible.

parser.pyx accepts file-like objects. codecs can provide online decoding-encoding from
the source encoding into utf-8 for consumption by c-parser (and pandas already does this
in some places).

utf-8 leaves ascii (specifically digits and punctuation) untouched, so the inferencers
are unaffected, and utf-8 strings are passed through just like ascii strings.
at that point, the decode framework from this PR would process the data
when you call in to DataFrame or what have you and make whatever is needed into unicode.
hey presto.

I did a quick test with the unicode_series.csv file from pandas/tests/data
(which uses latin-1 encoding) , and got utf-8 data back like a champ. I think this
is pretty much working out of the box right now.

here's the snippet for testing:

from pandas._parser import TextReader
import codecs
f=open("pandas/tests/data/unicode_series.csv","rb")
f2=codecs.EncodedFile(f,"utf-8","latin-1") # read in latin-1, convert to utf-8
r=TextReader(f2)
res=r.read()
s= res[1][-3].decode('utf-8')
print s
type(s) == unicode

Going even further: If I understand the code correctly, the fallback inferencer in parser.pyx
_string_box_factorize is responsible for boxing strings into python objects. It's easy enough
at that point to decode things from utf-8 byte strings into unicode objects, so that there is
no need for the clumsy (but general) approach used by this PR, with traversal over the entire
data structure, checking types and looking for strings to decode. most fields are usually numbers
anyway, so this is very inefficient. I think doing that would actually keep the massive speedup
of the new code AND have unicode() representations internally.

If that's true, then the performace issues of incorporating this PR as on by default fall away,
because large datasets, where performance really matters, will practically always be read
from files, and if that is kept fast, everything else can afford to do things less optimally.

All very encouraging I think.

As a sidenote- I noticed that parser.c has convert_infer, does not seem to be used anywhere
(commented it out and everything works, grep showed no usage), and It looks like that's the
only path to using _inference_order and the c-native inferencers. looks like all the inferencing
is actually done by the cython in parser.pyx. please correct me if I'm wrong.

ghost · 2012-10-24T05:16:29Z

And you get to unify all the csv readers into one. bonus.

ghost · 2012-10-24T15:29:36Z

I ran some benchmarks against the datasets from the blog: zeros.csv and matrix.csv.
It looks like the brute-force approach here gives roughly a 10x performance hit. no good.

still need to check if converting _string_box_factorize to emit unicode yields
something more reasonable.

ghost · 2012-10-27T06:38:19Z

FIlling in the Memory use aspect of the issue:

In py2.x , 3.1/3.2 python might use ucs-2 or ucs-4 (check with
sys.maxunicode), and so use up either 2x or 4x the memory,
respectively, over pure ascii or a legacy codepage that use one byte
per character. (on my debian system, the default python uses ucs-4).

On 3.3 which implemented pep-393 the in-memory representation is
adaptive in order to save memory, and will use one byte per character
for ascii, and 2 bytes for most other codepoints you would care about
(those in the BMP).

pep-393 suggests that, like py3.3, py2.7 (earlier?) uses just 1
byte per character to represent ascii strings, though the docs do not
mention this.

update: This is not the case, ascii stored as unicodein 2.7 suffers from the same bloat, It's just
that on 2.7 you have the option of using str() rather the unicode, an option removed by the 3.x branch,
3.3 does however provide the goods described in 393 and is much more efficient.

If you are using utf-8 to store your non-ascii data and keeping it
memory as an encoded byte string, the equivalent unicode
representation would most likely use either the same amount of memory (
python with ucs-2) or 2x the memory (ucs-4), since utf-8 itself uses 2 bytes
for most non-ascii characters.

wesm · 2012-10-27T13:28:02Z

I don't think making everything unicode by default is going to work because of the memory usage (and resulting performance, in some places) issue. The question is how to indicate to the parser that it should decode utf-8 (or whichever encoding) to unicode in the string boxing function. I'm planning a better strategy for handling strings in pandas in general, so what's "under the hood" may change at some point in the future

ghost · 2012-10-28T10:42:42Z

I was surprised by how performant #2130 actually is, and I don't see the case for memory
being a problem in RL, as long as it's true that ascii stored as unicode objects has no significant
overhead ( update: which turns out not to be true, so...). Having said that, I'd be perfectly happy with opt-in unicode,
as long as it's easy to switch on and that the rest of pandas then behaves as expected.

Now that c-parser has raised the bar, unicode users would be truely penalized without something
like #2130 being available for use.

I'm curious to see the next step as far as string handling is concerned, I'll leave unicode alone till
then. This PR was an interesting experiment but, Ultimately, just too hideous to really merge. oh well.

ghost · 2012-10-29T16:44:21Z

@wesm, 61f3405 for your cherry-picking consideration.

wesm · 2012-10-31T19:57:25Z

@y-p goign to cherry pick that one. I worked on the c-parser branch a bunch today and yesterday. Works on Python 3 and has an encoding option (so that all strings will be decoded to unicode). Unicode input (e.g. in a StringIO) will get encoded to UTF-8 when passed to the parser; I guess this could be configured

wesm · 2012-11-02T14:56:21Z

I have to undo 61f3405 immediately. All the sudden Index.__getattribute__ is dogging the performance of tons of things:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.081    0.081   25.520   25.520 <string>:1(<module>)
        1    0.369    0.369   25.404   25.404 groupby.py:242(apply)
        1    0.369    0.369   25.035   25.035 groupby.py:418(_python_apply_general)
        1    0.000    0.000    9.713    9.713 groupby.py:1671(_wrap_applied_output)
        1    0.001    0.001    9.713    9.713 groupby.py:445(_concat_objects)
        1    0.000    0.000    9.693    9.693 merge.py:828(concat)
    48001    0.095    0.000    9.402    0.000 groupby.py:531(get_iterator)
    48000    0.468    0.000    8.944    0.000 frame.py:2584(take)
        1    0.014    0.014    6.346    6.346 merge.py:950(get_result)
        1    0.135    0.135    6.321    6.321 merge.py:976(_get_concatenated_data)
    96002    0.398    0.000    5.587    0.000 frame.py:331(__init__)
        5    0.489    0.098    5.585    1.117 merge.py:1052(_concat_single_item)
    48000    0.121    0.000    5.134    0.000 <string>:1(<lambda>)
    48000    0.675    0.000    5.013    0.000 frame.py:3660(shift)
3120099/1776071    2.901    0.000    4.978    0.000 index.py:308(__getattribute__)
   240000    0.245    0.000    4.158    0.000 internals.py:826(get)
        1    0.055    0.055    3.347    3.347 merge.py:888(__init__)
    96002    0.301    0.000    3.200    0.000 internals.py:484(__init__)
    48000    0.272    0.000    3.131    0.000 frame.py:477(_init_ndarray)
        1    0.000    0.000    2.929    2.929 merge.py:1093(_get_new_axes)
        1    0.029    0.029    2.883    2.883 merge.py:1129(_get_concat`

ghost · 2012-11-02T15:21:23Z

I suppose that outweighs the "correctness" win. too bad.

wesm · 2012-11-02T15:49:36Z

The Index object needs some work; I will definitely keep this in mind

y-p added 9 commits October 21, 2012 15:39

CLN: remove test which uses non-ascii bytestrings without specifying …

f7f98b9

…encoding

CLN: unicode-strings should contain unicode code points, not encoded …

61fdac5

…bytes.

Note: _put_lines fails when encoded bytestrings and unicode objects a…

dede499

…re mixed This case can cause repr_html to fail.

ENH: added module to handle decoding byte-string within objects to un…

ff9f24f

…icode

CLN: cleanup unicode handling in format._to_str_columns()

93421a8

Now that internal representation is unicode, we can make things more streamlined, the code here which assumes all bytestrings are utf-8 encoded can be trimmed.

ENH: Add decoding hooks into Index,Series,DataFrame

2a24b61

Incroporate into Getters,Setters and constructors

TST: add tests/test_encoding

77ea758

ENH: make Unicode csv writer the default

1aa1ec0

ghost mentioned this pull request Oct 26, 2012

Benchmarks: Modifying c-parser to return unicode #2130

Closed

ghost closed this Oct 28, 2012

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode III : revenge of the character planes #2104

Unicode III : revenge of the character planes #2104

ghost commented Oct 22, 2012

wesm commented Oct 22, 2012

ghost commented Oct 22, 2012

wesm commented Oct 22, 2012

ghost commented Oct 23, 2012

ghost commented Oct 23, 2012

wesm commented Oct 23, 2012

ghost commented Oct 24, 2012

ghost commented Oct 24, 2012

ghost commented Oct 24, 2012

ghost commented Oct 27, 2012

wesm commented Oct 27, 2012

ghost commented Oct 28, 2012

ghost commented Oct 29, 2012

wesm commented Oct 31, 2012

wesm commented Nov 2, 2012

ghost commented Nov 2, 2012

wesm commented Nov 2, 2012

Unicode III : revenge of the character planes #2104

Unicode III : revenge of the character planes #2104

Conversation

ghost commented Oct 22, 2012

A.Implementation

B. Benefits

C. Disadvantages

D. Points to consider (feedback welcome)

wesm commented Oct 22, 2012

ghost commented Oct 22, 2012

wesm commented Oct 22, 2012

ghost commented Oct 23, 2012

ghost commented Oct 23, 2012

wesm commented Oct 23, 2012

ghost commented Oct 24, 2012

ghost commented Oct 24, 2012

ghost commented Oct 24, 2012

ghost commented Oct 27, 2012

wesm commented Oct 27, 2012

ghost commented Oct 28, 2012

ghost commented Oct 29, 2012

wesm commented Oct 31, 2012

wesm commented Nov 2, 2012

ghost commented Nov 2, 2012

wesm commented Nov 2, 2012