Benchmarks: Modifying c-parser to return unicode #2130

ghost · 2012-10-26T08:54:52Z

Continuing here from #2104, because this is related but indep.
I altered c-parser so thatparser.pyx decode all strings into unicode objects with utf-8 (b16b24b).
Since any source encoding can be transcoded into utf-8, this is a general solution for
getting c-parser to always return unicode. tested on py2.

Data Used:

1M-latin.csv was generated from unicode_series.csv (in tests/data), by replicating it to yield 1 million lines (roughly 32MB), each row contains a number and a longish string, so the overhead of encoding is most pronounced in this case.
zeros.csv(10MB) and matrix.csv(150MB) are from the recent blog post benchmarking c-parser, They contain just integer/float numbers.

Code versions tested:

c-parser : current c-parser branch [380f6e6]
c-parser-t: same, but the file is routed through codecs and encoded into utf-8
as part of the test (1M-latin using latin-1->utf-8, the other two ascii->utf-8, just for uniformity) .
c-parser-u-from-ascii: with b16b24b - tested against utf-8/ascii files, so no transcoding needed,all strings (actually none for the files tested) are decoded with utf-8 to yield unicode objects.
c-parser-u: with b16b24b - the files are transcoded, and the parser decodes all string using
utf-8 to yield unicode objects.

How:
using IPython's %timeit, which gave best of 3 runs.

Results:

	c-parser	c-parser-t	c-parser-u-from-ascii	c-parser-u
zeros.csv	717 ms	768 ms	724ms	777 ms
matrix.csv	2.36 sec	3.17 sec	2.37 sec	3.21 sec
1M-latin.csv	427 ms	558 ms	N/A	570 ms

Conclusions:

There's a performance hit, but the result is still very respectable,
and much better then the 10X hit in Unicode III : revenge of the character planes #2104, which is forced to traverse the
entire dataset, checking the type of each element.
Most of the performance hit is due to the transcoding process, not
the decoding into utf-8. Since that step is unnecessary when the data file
is already encoded in utf-8 (and pure ascii fits into that catagory),
That extra work is somewhat justified, and performance is still very competitive even
with large files.
The cost of returning unicode by default is virtually nil when transcoding isn't needed,
even when the file contains mostly strings.

The text was updated successfully, but these errors were encountered:

gerigk · 2012-10-26T09:20:10Z

Would this behaviour be optional? I am using PyTables for storage and PyTables does not support unicode but only utf-8 encoded strings.
PyTables/PyTables#151

The performance of encoding unicode Series to utf-8 is pretty bad and would hit reads/writes of the format I am using in a pretty annoying way.
Another question I have would be about memory usage. Are Unicode objects "bigger" since they are decoded UTF-8 strings (for character sets like cyrillic etc)?

ghost · 2012-10-26T10:20:13Z

that's up to wesm, who has peformance very high on his priority list. I wouldn't worry.

Are you sure you'll be impacted? are you using very large datasets, encoded with something
other then ascii/utf-8, which contain mostly strings rather then numbers, have no issues with
the current state of unicode in pandas, and are already using the c-parser branch (or else you
would get a big speed bump when cparser lands in master, even if this were integrated)?

If that's the case, it would be helpful if you could put up an example dataset of the kind of data
you care about, so that future PR can be tested against real-world cases.

IMO, even the worse-case is not that bad, but i'm sure others would disagree.

As to memory, I have no numbers right now, but I certainly would expect some overhead.
This was prompted by the disappoinitng perf of #2104, and so was focused on cpu.

I know py3.1/3.2 are particularly bad when it comes to unicode overhead, with 3.3 coming back to
py2.7 levels. I'll look into it when I get a chance.

gerigk · 2012-10-26T16:00:41Z

The situation I am worried about is when I read a csv and this returns a Unicode object Series.
I then want to pass this series to a function that requires a Series/numpy array with utf-8 encoded strings. The way I see it I would have to call Series.str.encode('utf-8') which is slow.
That's why I usually use Strings in the first place (which wouldn't be possible if the csv reader returned unicode objects).

ghost · 2012-10-27T06:39:09Z

See added comment in #2104 for notes on memory use.

ghost · 2012-10-31T20:35:19Z

c-parser branch was updated to support returning unicode optionally.

wesm · 2012-10-31T20:38:25Z

I'm going to reopen this so I can re-examine these benchmarks-- I seriously did the bare minimum necessary to get the test suite passing, no more no less

ghost · 2013-04-18T08:20:10Z

The users have been quiet enough in the past 6 months, must be working.

closing.

ghost mentioned this issue Oct 28, 2012

Unicode III : revenge of the character planes #2104

Closed

ghost closed this as completed Oct 31, 2012

wesm reopened this Oct 31, 2012

ghost closed this as completed Apr 18, 2013

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks: Modifying c-parser to return unicode #2130

Benchmarks: Modifying c-parser to return unicode #2130

ghost commented Oct 26, 2012

gerigk commented Oct 26, 2012

ghost commented Oct 26, 2012

gerigk commented Oct 26, 2012

ghost commented Oct 27, 2012

ghost commented Oct 31, 2012

wesm commented Oct 31, 2012

ghost commented Apr 18, 2013

Benchmarks: Modifying c-parser to return unicode #2130

Benchmarks: Modifying c-parser to return unicode #2130

Comments

ghost commented Oct 26, 2012

gerigk commented Oct 26, 2012

ghost commented Oct 26, 2012

gerigk commented Oct 26, 2012

ghost commented Oct 27, 2012

ghost commented Oct 31, 2012

wesm commented Oct 31, 2012

ghost commented Apr 18, 2013