Skip to content

Benchmarks: Modifying c-parser to return unicode #2130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Oct 26, 2012 · 7 comments
Closed

Benchmarks: Modifying c-parser to return unicode #2130

ghost opened this issue Oct 26, 2012 · 7 comments
Labels
IO Data IO issues that don't fit into a more specific label Unicode Unicode strings
Milestone

Comments

@ghost
Copy link

ghost commented Oct 26, 2012

Continuing here from #2104, because this is related but indep.
I altered c-parser so thatparser.pyx decode all strings into unicode objects with utf-8 (b16b24b).
Since any source encoding can be transcoded into utf-8, this is a general solution for
getting c-parser to always return unicode. tested on py2.

Data Used:

1M-latin.csv was generated from unicode_series.csv (in tests/data), by replicating it to yield 1 million lines (roughly 32MB), each row contains a number and a longish string, so the overhead of encoding is most pronounced in this case.
zeros.csv(10MB) and matrix.csv(150MB) are from the recent blog post benchmarking c-parser, They contain just integer/float numbers.

Code versions tested:

  • c-parser : current c-parser branch [380f6e6]
  • c-parser-t: same, but the file is routed through codecs and encoded into utf-8
    as part of the test (1M-latin using latin-1->utf-8, the other two ascii->utf-8, just for uniformity) .
  • c-parser-u-from-ascii: with b16b24b - tested against utf-8/ascii files, so no transcoding needed,all strings (actually none for the files tested) are decoded with utf-8 to yield unicode objects.
  • c-parser-u: with b16b24b - the files are transcoded, and the parser decodes all string using
    utf-8 to yield unicode objects.

How:
using IPython's %timeit, which gave best of 3 runs.

Results:

c-parser c-parser-t c-parser-u-from-ascii c-parser-u
zeros.csv 717 ms 768 ms 724ms 777 ms
matrix.csv 2.36 sec 3.17 sec 2.37 sec 3.21 sec
1M-latin.csv 427 ms 558 ms N/A 570 ms

Conclusions:

  • There's a performance hit, but the result is still very respectable,
    and much better then the 10X hit in Unicode III : revenge of the character planes #2104, which is forced to traverse the
    entire dataset, checking the type of each element.
  • Most of the performance hit is due to the transcoding process, not
    the decoding into utf-8. Since that step is unnecessary when the data file
    is already encoded in utf-8 (and pure ascii fits into that catagory),
    That extra work is somewhat justified, and performance is still very competitive even
    with large files.
  • The cost of returning unicode by default is virtually nil when transcoding isn't needed,
    even when the file contains mostly strings.
@gerigk
Copy link

gerigk commented Oct 26, 2012

Would this behaviour be optional? I am using PyTables for storage and PyTables does not support unicode but only utf-8 encoded strings.
PyTables/PyTables#151

The performance of encoding unicode Series to utf-8 is pretty bad and would hit reads/writes of the format I am using in a pretty annoying way.
Another question I have would be about memory usage. Are Unicode objects "bigger" since they are decoded UTF-8 strings (for character sets like cyrillic etc)?

@ghost
Copy link
Author

ghost commented Oct 26, 2012

that's up to wesm, who has peformance very high on his priority list. I wouldn't worry.

Are you sure you'll be impacted? are you using very large datasets, encoded with something
other then ascii/utf-8, which contain mostly strings rather then numbers, have no issues with
the current state of unicode in pandas, and are already using the c-parser branch (or else you
would get a big speed bump when cparser lands in master, even if this were integrated)?

If that's the case, it would be helpful if you could put up an example dataset of the kind of data
you care about, so that future PR can be tested against real-world cases.

IMO, even the worse-case is not that bad, but i'm sure others would disagree.

As to memory, I have no numbers right now, but I certainly would expect some overhead.
This was prompted by the disappoinitng perf of #2104, and so was focused on cpu.

I know py3.1/3.2 are particularly bad when it comes to unicode overhead, with 3.3 coming back to
py2.7 levels. I'll look into it when I get a chance.

@gerigk
Copy link

gerigk commented Oct 26, 2012

The situation I am worried about is when I read a csv and this returns a Unicode object Series.
I then want to pass this series to a function that requires a Series/numpy array with utf-8 encoded strings. The way I see it I would have to call Series.str.encode('utf-8') which is slow.
That's why I usually use Strings in the first place (which wouldn't be possible if the csv reader returned unicode objects).

@ghost
Copy link
Author

ghost commented Oct 27, 2012

See added comment in #2104 for notes on memory use.

@ghost
Copy link
Author

ghost commented Oct 31, 2012

c-parser branch was updated to support returning unicode optionally.

@ghost ghost closed this as completed Oct 31, 2012
@wesm
Copy link
Member

wesm commented Oct 31, 2012

I'm going to reopen this so I can re-examine these benchmarks-- I seriously did the bare minimum necessary to get the test suite passing, no more no less

@wesm wesm reopened this Oct 31, 2012
@ghost
Copy link
Author

ghost commented Apr 18, 2013

The users have been quiet enough in the past 6 months, must be working.

closing.

@ghost ghost closed this as completed Apr 18, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

2 participants