Skip to content

Unicode III : revenge of the character planes #2104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Unicode III : revenge of the character planes #2104

wants to merge 9 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Oct 22, 2012

This is the 3rd installment in the unicode saga, and once again,
there's a lot of issues to point out. fair warning given.
#1994 cleaned up things to use basestring and unicode() over str() where needed
#2005 consolidated and implemented unicode-friendly pretty-printing

The attempt here is to ensure that the internal representation of strings is
always a unicode object, as is the case by default in python3.

It's not finished, but I'd like to get feedback.

A.Implementation

  1. core.encoding implements the functionality for traversing sequences and decoding
    strings into unicode objects inplace. this includes key/values of dicts (one
    of the many supported input structures for constructing pandas data objects).
  2. pure ascii strings are left untouched.
  3. For strings containing unicode characters, we do actually change the input,
    but since getters are symmetricly handled, the result should be seamless (if you
    set a key using an encoded bytestrings,invoking a getter with the same key should
    get you back the data, even though internally, the data is stored under the equiv.
    unicode key).
  4. this decoding step is placed in "choke-points" such as constructors, getters and
    setters. I've probably missed a few at this point.
  5. the encoding to be used for decoding is specified in a configurable (which will
    be settable by the user, possibly via the mechanism in PR: adding a core.config module to hold package-wide configurables #2097.
  6. if decoding fails, the user gets a message explaining what happend. if the chardet
    package is available, the code tries to suggest an encoding to the user based
    on the data. chardet has enough false-positives to make automatic-detection
    a bad idea.
  7. as part of the unicode-or-bust theme, the csv reader now uses utf-8 by default,
    which forces the use of UnicodeReader, so you either get unicode or a decode error,
    forcing the user to specify an encoding.

B. Benefits

  1. Making internal representations unicode should help reduce the whack-a-mole
    which has been going on.
  2. Once things are guranteed unicode, lib code can stop checking for and handling
    corner-cases.
  3. Mixing unicode and encoded byte-strings, or bytestrings encoded with
    different encodings can create insoluable situations.
    A related example is fmt._put_lines(), which was altered to fix DataFrame.to_html encoding #891 so
    it would handle the case of mixing pure ascii with unicode.However, it now fails
    when unicode is mixed with encoded byte-strings (which makes repr_html() fail,
    yielding sometimes-html-sometimes-text repr confusion). This PR fixes that problem
    amongst other things.
  4. since python encodes input from the console using the console encoding, we can
    convert things to unicode transparently for the user, so the behaviour becomes
    closer to that of python3 - enter a string at the console, and you get unicode by
    default. (no need for unicode literal).

C. Disadvantages

  1. It's hackish.
  2. There's a performance hit (@wesm, what benchmarks would you like to see?)
  3. Immutable sequences are currently not handled. not just because of copying, but
    also because an experiment converting tuples raised all sorts of mysterious
    problems, in cython code amongst other things.
  4. Probably difficult to cover every possible entry-point for bytes-strings.
  5. Not sure about c-parser compat. issues.

D. Points to consider (feedback welcome)

  1. there will be a configurable to turn all of this on or off. Not sure what the
    default should be though.
  2. Should the encoding used for decoding be specifiable per-object? or just
    a global configurable available to the user.
  3. what's the best way to deal with Immutable sequences?

The configurables will be sorted out later (Pending #2097?).

y-p added 9 commits October 21, 2012 15:39
`Index` inherits from np.ndarray, yet implements an immutable datatype,
Pythonic duck-typing would suggest that the presence of __setitem__
signals a mutable datatype, while the overriden implementaion
just raises an exception - a bait and switch.
Python does not offer a really clean way to eliminate an attribute
inherited from a superclass, but overriding __getattribute__ gives
us the same end result.
squash with 218fe0a
…re mixed

This case can cause repr_html to fail.
Now that internal representation is unicode, we can
make things more streamlined, the code here which
assumes all bytestrings are utf-8 encoded can be trimmed.
Incroporate into Getters,Setters and constructors
@wesm
Copy link
Member

wesm commented Oct 22, 2012

I haven't looked carefully yet, but it looks like this is decoding all strings (e.g. those contained in a Series or DataFrame index) to Unicode? The cost of this in Python 2.x would be pretty extraordinary if so. Have to look some more when I have some more time.

@ghost
Copy link
Author

ghost commented Oct 22, 2012

yes, this one is a lot more aggresive then the previous two. perhaps too much.
could be just as an option disabled by default.

Doesn't python3 have to do just as much decoding upfront? (Although without traversal)

@wesm
Copy link
Member

wesm commented Oct 22, 2012

Probably, but most people running pandas in production where performance (and memory use) matters are on 2.x. Foisting unicode on them seems like probably an undue burden. Making it globally configurable may not be a bad idea though. I would need to do some testing to see, anyway

I'll have to think about what to do in the c-parser branch, because Python3 and unicode handling there is a bit of an unknown. I think the fast-parser is going to have to be bytes-only for a while until a mysterious superhero wants to do a lot of C hacking with unicode. The performance parsing bytes is going to be the maximum, and that represents the majority of real-world use cases. Perhaps only parsing unicode via the pure-Python interface for now makes sense

@ghost
Copy link
Author

ghost commented Oct 23, 2012

Fair enough.

I use pandas with modest datasets where unicode is a problem and perf.
is not. So having such an option would be a win.

IMO, for pandas to be "Unicode-safe" something like this is needed.
The responsebility for being consistent can be put on the user (want unicode?
pass in unicode) or enforced by something like this. But the code should stop
trying to support mixing unicode with encoded byte-strings.

The code actually does more work then strictly needed, because
it tries to preserve pure ascii as bytes. I have to think more if that's a good
or bad idea.

If #2097 or something to that end is merged, I'll make sure everything can be turned
on/off by the user. and the default can be chosen after doing some benchmarks
(but it looks like off is the way to go).

Let me know what changes you think are needed.

@ghost
Copy link
Author

ghost commented Oct 23, 2012

Regarding c-parser, is it bytes-only or ascii-only? I'm assuming you're still calling
into the python constructors to actually build the pandas objects, so if the parser can handle utf-8
bytes, It may actually not be a challange to wrap (again - aside from performance).

@wesm
Copy link
Member

wesm commented Oct 23, 2012

The tokenizer is written assuming you are receiving chunks of data as char*.

@ghost
Copy link
Author

ghost commented Oct 24, 2012

@wesm, I think this might be easier then you think. The way you've built this is very flexible.

parser.pyx accepts file-like objects. codecs can provide online decoding-encoding from
the source encoding into utf-8 for consumption by c-parser (and pandas already does this
in some places).

utf-8 leaves ascii (specifically digits and punctuation) untouched, so the inferencers
are unaffected, and utf-8 strings are passed through just like ascii strings.
at that point, the decode framework from this PR would process the data
when you call in to DataFrame or what have you and make whatever is needed into unicode.
hey presto.

I did a quick test with the unicode_series.csv file from pandas/tests/data
(which uses latin-1 encoding) , and got utf-8 data back like a champ. I think this
is pretty much working out of the box right now.

here's the snippet for testing:

from pandas._parser import TextReader
import codecs
f=open("pandas/tests/data/unicode_series.csv","rb")
f2=codecs.EncodedFile(f,"utf-8","latin-1") # read in latin-1, convert to utf-8
r=TextReader(f2)
res=r.read()
s= res[1][-3].decode('utf-8')
print s
type(s) == unicode

Going even further: If I understand the code correctly, the fallback inferencer in parser.pyx
_string_box_factorize is responsible for boxing strings into python objects. It's easy enough
at that point to decode things from utf-8 byte strings into unicode objects, so that there is
no need for the clumsy (but general) approach used by this PR, with traversal over the entire
data structure, checking types and looking for strings to decode. most fields are usually numbers
anyway, so this is very inefficient. I think doing that would actually keep the massive speedup
of the new code AND have unicode() representations internally.

If that's true, then the performace issues of incorporating this PR as on by default fall away,
because large datasets, where performance really matters, will practically always be read
from files, and if that is kept fast, everything else can afford to do things less optimally.

All very encouraging I think.

As a sidenote- I noticed that parser.c has convert_infer, does not seem to be used anywhere
(commented it out and everything works, grep showed no usage), and It looks like that's the
only path to using _inference_order and the c-native inferencers. looks like all the inferencing
is actually done by the cython in parser.pyx. please correct me if I'm wrong.

@ghost
Copy link
Author

ghost commented Oct 24, 2012

And you get to unify all the csv readers into one. bonus.

@ghost
Copy link
Author

ghost commented Oct 24, 2012

I ran some benchmarks against the datasets from the blog: zeros.csv and matrix.csv.
It looks like the brute-force approach here gives roughly a 10x performance hit. no good.

still need to check if converting _string_box_factorize to emit unicode yields
something more reasonable.

@ghost
Copy link
Author

ghost commented Oct 27, 2012

FIlling in the Memory use aspect of the issue:

In py2.x , 3.1/3.2 python might use ucs-2 or ucs-4 (check with
sys.maxunicode), and so use up either 2x or 4x the memory,
respectively, over pure ascii or a legacy codepage that use one byte
per character. (on my debian system, the default python uses ucs-4).

On 3.3 which implemented pep-393 the in-memory representation is
adaptive in order to save memory, and will use one byte per character
for ascii, and 2 bytes for most other codepoints you would care about
(those in the BMP).

pep-393 suggests that, like py3.3, py2.7 (earlier?) uses just 1
byte per character to represent ascii strings, though the docs do not
mention this.

update: This is not the case, ascii stored as unicodein 2.7 suffers from the same bloat, It's just
that on 2.7 you have the option of using str() rather the unicode, an option removed by the 3.x branch,
3.3 does however provide the goods described in 393 and is much more efficient.

If you are using utf-8 to store your non-ascii data and keeping it
memory as an encoded byte string, the equivalent unicode
representation would most likely use either the same amount of memory (
python with ucs-2) or 2x the memory (ucs-4), since utf-8 itself uses 2 bytes
for most non-ascii characters.

@wesm
Copy link
Member

wesm commented Oct 27, 2012

I don't think making everything unicode by default is going to work because of the memory usage (and resulting performance, in some places) issue. The question is how to indicate to the parser that it should decode utf-8 (or whichever encoding) to unicode in the string boxing function. I'm planning a better strategy for handling strings in pandas in general, so what's "under the hood" may change at some point in the future

@ghost
Copy link
Author

ghost commented Oct 28, 2012

I was surprised by how performant #2130 actually is, and I don't see the case for memory
being a problem in RL, as long as it's true that ascii stored as unicode objects has no significant
overhead ( update: which turns out not to be true, so...). Having said that, I'd be perfectly happy with opt-in unicode,
as long as it's easy to switch on and that the rest of pandas then behaves as expected.

Now that c-parser has raised the bar, unicode users would be truely penalized without something
like #2130 being available for use.

I'm curious to see the next step as far as string handling is concerned, I'll leave unicode alone till
then. This PR was an interesting experiment but, Ultimately, just too hideous to really merge. oh well.

@ghost ghost closed this Oct 28, 2012
@ghost
Copy link
Author

ghost commented Oct 29, 2012

@wesm, 61f3405 for your cherry-picking consideration.

@wesm
Copy link
Member

wesm commented Oct 31, 2012

@y-p goign to cherry pick that one. I worked on the c-parser branch a bunch today and yesterday. Works on Python 3 and has an encoding option (so that all strings will be decoded to unicode). Unicode input (e.g. in a StringIO) will get encoded to UTF-8 when passed to the parser; I guess this could be configured

@wesm
Copy link
Member

wesm commented Nov 2, 2012

I have to undo 61f3405 immediately. All the sudden Index.__getattribute__ is dogging the performance of tons of things:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.081    0.081   25.520   25.520 <string>:1(<module>)
        1    0.369    0.369   25.404   25.404 groupby.py:242(apply)
        1    0.369    0.369   25.035   25.035 groupby.py:418(_python_apply_general)
        1    0.000    0.000    9.713    9.713 groupby.py:1671(_wrap_applied_output)
        1    0.001    0.001    9.713    9.713 groupby.py:445(_concat_objects)
        1    0.000    0.000    9.693    9.693 merge.py:828(concat)
    48001    0.095    0.000    9.402    0.000 groupby.py:531(get_iterator)
    48000    0.468    0.000    8.944    0.000 frame.py:2584(take)
        1    0.014    0.014    6.346    6.346 merge.py:950(get_result)
        1    0.135    0.135    6.321    6.321 merge.py:976(_get_concatenated_data)
    96002    0.398    0.000    5.587    0.000 frame.py:331(__init__)
        5    0.489    0.098    5.585    1.117 merge.py:1052(_concat_single_item)
    48000    0.121    0.000    5.134    0.000 <string>:1(<lambda>)
    48000    0.675    0.000    5.013    0.000 frame.py:3660(shift)
3120099/1776071    2.901    0.000    4.978    0.000 index.py:308(__getattribute__)
   240000    0.245    0.000    4.158    0.000 internals.py:826(get)
        1    0.055    0.055    3.347    3.347 merge.py:888(__init__)
    96002    0.301    0.000    3.200    0.000 internals.py:484(__init__)
    48000    0.272    0.000    3.131    0.000 frame.py:477(_init_ndarray)
        1    0.000    0.000    2.929    2.929 merge.py:1093(_get_new_axes)
        1    0.029    0.029    2.883    2.883 merge.py:1129(_get_concat`

@ghost
Copy link
Author

ghost commented Nov 2, 2012

I suppose that outweighs the "correctness" win. too bad.

@wesm
Copy link
Member

wesm commented Nov 2, 2012

The Index object needs some work; I will definitely keep this in mind

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.to_html encoding
1 participant