-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Benchmarks: Modifying c-parser to return unicode #2130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would this behaviour be optional? I am using PyTables for storage and PyTables does not support unicode but only utf-8 encoded strings. The performance of encoding unicode Series to utf-8 is pretty bad and would hit reads/writes of the format I am using in a pretty annoying way. |
that's up to wesm, who has peformance very high on his priority list. I wouldn't worry. Are you sure you'll be impacted? are you using very large datasets, encoded with something If that's the case, it would be helpful if you could put up an example dataset of the kind of data IMO, even the worse-case is not that bad, but i'm sure others would disagree. As to memory, I have no numbers right now, but I certainly would expect some overhead. I know py3.1/3.2 are particularly bad when it comes to unicode overhead, with 3.3 coming back to |
The situation I am worried about is when I read a csv and this returns a Unicode object Series. |
c-parser branch was updated to support returning unicode optionally. |
I'm going to reopen this so I can re-examine these benchmarks-- I seriously did the bare minimum necessary to get the test suite passing, no more no less |
The users have been quiet enough in the past 6 months, must be working. closing. |
Continuing here from #2104, because this is related but indep.
I altered c-parser so thatparser.pyx decode all strings into unicode objects with utf-8 (b16b24b).
Since any source encoding can be transcoded into utf-8, this is a general solution for
getting c-parser to always return unicode. tested on py2.
Data Used:
1M-latin.csv
was generated fromunicode_series.csv
(in tests/data), by replicating it to yield 1 million lines (roughly 32MB), each row contains a number and a longish string, so the overhead of encoding is most pronounced in this case.zeros.csv
(10MB) andmatrix.csv
(150MB) are from the recent blog post benchmarking c-parser, They contain just integer/float numbers.Code versions tested:
codecs
and encoded into utf-8as part of the test (1M-latin using latin-1->utf-8, the other two ascii->utf-8, just for uniformity) .
utf-8 to yield unicode objects.
How:
using IPython's %timeit, which gave best of 3 runs.
Results:
Conclusions:
and much better then the 10X hit in Unicode III : revenge of the character planes #2104, which is forced to traverse the
entire dataset, checking the type of each element.
the decoding into utf-8. Since that step is unnecessary when the data file
is already encoded in utf-8 (and pure ascii fits into that catagory),
That extra work is somewhat justified, and performance is still very competitive even
with large files.
even when the file contains mostly strings.
The text was updated successfully, but these errors were encountered: