-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Unicode III : revenge of the character planes #2104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`Index` inherits from np.ndarray, yet implements an immutable datatype, Pythonic duck-typing would suggest that the presence of __setitem__ signals a mutable datatype, while the overriden implementaion just raises an exception - a bait and switch. Python does not offer a really clean way to eliminate an attribute inherited from a superclass, but overriding __getattribute__ gives us the same end result. squash with 218fe0a
…re mixed This case can cause repr_html to fail.
Now that internal representation is unicode, we can make things more streamlined, the code here which assumes all bytestrings are utf-8 encoded can be trimmed.
Incroporate into Getters,Setters and constructors
I haven't looked carefully yet, but it looks like this is decoding all strings (e.g. those contained in a Series or DataFrame index) to Unicode? The cost of this in Python 2.x would be pretty extraordinary if so. Have to look some more when I have some more time. |
yes, this one is a lot more aggresive then the previous two. perhaps too much. Doesn't python3 have to do just as much decoding upfront? (Although without traversal) |
Probably, but most people running pandas in production where performance (and memory use) matters are on 2.x. Foisting unicode on them seems like probably an undue burden. Making it globally configurable may not be a bad idea though. I would need to do some testing to see, anyway I'll have to think about what to do in the c-parser branch, because Python3 and unicode handling there is a bit of an unknown. I think the fast-parser is going to have to be bytes-only for a while until a mysterious superhero wants to do a lot of C hacking with unicode. The performance parsing bytes is going to be the maximum, and that represents the majority of real-world use cases. Perhaps only parsing unicode via the pure-Python interface for now makes sense |
Fair enough. I use pandas with modest datasets where unicode is a problem and perf. IMO, for pandas to be "Unicode-safe" something like this is needed. The code actually does more work then strictly needed, because If #2097 or something to that end is merged, I'll make sure everything can be turned Let me know what changes you think are needed. |
Regarding |
The tokenizer is written assuming you are receiving chunks of data as |
@wesm, I think this might be easier then you think. The way you've built this is very flexible.
utf-8 leaves ascii (specifically digits and punctuation) untouched, so the inferencers I did a quick test with the unicode_series.csv file from pandas/tests/data here's the snippet for testing: from pandas._parser import TextReader
import codecs
f=open("pandas/tests/data/unicode_series.csv","rb")
f2=codecs.EncodedFile(f,"utf-8","latin-1") # read in latin-1, convert to utf-8
r=TextReader(f2)
res=r.read()
s= res[1][-3].decode('utf-8')
print s
type(s) == unicode Going even further: If I understand the code correctly, the fallback inferencer in If that's true, then the performace issues of incorporating this PR as on by default fall away, All very encouraging I think. As a sidenote- I noticed that parser.c has convert_infer, does not seem to be used anywhere |
And you get to unify all the csv readers into one. bonus. |
I ran some benchmarks against the datasets from the blog: still need to check if converting |
FIlling in the Memory use aspect of the issue: In py2.x , 3.1/3.2 python might use ucs-2 or ucs-4 (check with On 3.3 which implemented pep-393 the in-memory representation is pep-393 suggests that, like py3.3, py2.7 (earlier?) uses just 1 update: This is not the case, ascii stored as unicodein 2.7 suffers from the same bloat, It's just If you are using utf-8 to store your non-ascii data and keeping it |
I don't think making everything unicode by default is going to work because of the memory usage (and resulting performance, in some places) issue. The question is how to indicate to the parser that it should decode utf-8 (or whichever encoding) to unicode in the string boxing function. I'm planning a better strategy for handling strings in pandas in general, so what's "under the hood" may change at some point in the future |
I was surprised by how performant #2130 actually is, and I don't see the case for memory Now that c-parser has raised the bar, unicode users would be truely penalized without something I'm curious to see the next step as far as string handling is concerned, I'll leave unicode alone till |
@y-p goign to cherry pick that one. I worked on the c-parser branch a bunch today and yesterday. Works on Python 3 and has an encoding option (so that all strings will be decoded to unicode). Unicode input (e.g. in a StringIO) will get encoded to UTF-8 when passed to the parser; I guess this could be configured |
I have to undo 61f3405 immediately. All the sudden
|
I suppose that outweighs the "correctness" win. too bad. |
The Index object needs some work; I will definitely keep this in mind |
This is the 3rd installment in the unicode saga, and once again,
there's a lot of issues to point out. fair warning given.
#1994 cleaned up things to use basestring and unicode() over str() where needed
#2005 consolidated and implemented unicode-friendly pretty-printing
The attempt here is to ensure that the internal representation of strings is
always a unicode object, as is the case by default in python3.
It's not finished, but I'd like to get feedback.
A.Implementation
strings into unicode objects inplace. this includes key/values of dicts (one
of the many supported input structures for constructing pandas data objects).
but since getters are symmetricly handled, the result should be seamless (if you
set a key using an encoded bytestrings,invoking a getter with the same key should
get you back the data, even though internally, the data is stored under the equiv.
unicode key).
setters. I've probably missed a few at this point.
be settable by the user, possibly via the mechanism in PR: adding a core.config module to hold package-wide configurables #2097.
chardet
package is available, the code tries to suggest an encoding to the user based
on the data.
chardet
has enough false-positives to make automatic-detectiona bad idea.
which forces the use of UnicodeReader, so you either get unicode or a decode error,
forcing the user to specify an encoding.
B. Benefits
which has been going on.
corner-cases.
different encodings can create insoluable situations.
A related example is
fmt._put_lines()
, which was altered to fix DataFrame.to_html encoding #891 soit would handle the case of mixing pure ascii with unicode.However, it now fails
when unicode is mixed with encoded byte-strings (which makes
repr_html()
fail,yielding sometimes-html-sometimes-text repr confusion). This PR fixes that problem
amongst other things.
convert things to unicode transparently for the user, so the behaviour becomes
closer to that of python3 - enter a string at the console, and you get unicode by
default. (no need for unicode literal).
C. Disadvantages
also because an experiment converting tuples raised all sorts of mysterious
problems, in cython code amongst other things.
D. Points to consider (feedback welcome)
default should be though.
a global configurable available to the user.
The configurables will be sorted out later (Pending #2097?).