Skip to content

PERF: Improved performance for .str.encode/decode #13008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Winand
Copy link
Contributor

@Winand Winand commented Apr 27, 2016

I need such a patch to read huge sas tables encoded in cp1251. I'm not experienced enough to determine if such a patch is really needed here, but well.. it gives some speed in certain situations.

Optimize string encoding-decoding, leave default implementation for CPython optimized encodings,
(see https://docs.python.org/3.4/library/codecs.html#standard-encodings)

string

import pandas as pd
s1 = pd.Series(pd.util.testing.makeStringIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32', 'latin1', 'iso-8859-1', 'mbcs', 'ascii', 'cp1251', 'cp1252'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

unicode

import pandas as pd
s1 = pd.Series(pd.util.testing.makeUnicodeIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

image
("10 loops, best of 3: xxx ms per loop")

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Apr 27, 2016
@@ -1182,7 +1183,13 @@ def str_decode(arr, encoding, errors="strict"):
-------
decoded : Series/Index of objects
"""
f = lambda x: x.decode(encoding, errors)
if encoding in ("utf-8", "utf8", "latin-1", "latin1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define these at the top of the file:

_cpython_optimized_encoding = .....

@jreback
Copy link
Contributor

jreback commented Apr 27, 2016

@Winand in your example you miss the point of categorical encoding:

In [2]: pd.util.testing.makeUnicodeIndex(k=100000).nunique()
Out[2]: 100000

categorical in general make sense if k << n. slight overhead when k == n.

have you tried read_sas?

@Winand
Copy link
Contributor Author

Winand commented Apr 27, 2016

import pandas as pd
s1 = pd.Series(pd.util.testing.makeStringIndex(k=1000))
s1 = pd.concat([s1]*100).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32', 'latin1', 'iso-8859-1', 'mbcs', 'ascii', 'cp1251', 'cp1252'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

image

@jreback i tried read_sas from 0.18.0, it was more than twice slower than (slightly improved) Jared Hobbs' sas7bdat module. A lot of time is spent on decoding cp1251

@jreback
Copy link
Contributor

jreback commented Apr 27, 2016

try with master. its now 4x faster.

@Winand
Copy link
Contributor Author

Winand commented Apr 28, 2016

@jreback i've tried.) 2x faster on my test table, which is still amazing improvement
image

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

cc @kshedden

ok, seems reasonable.

@jreback jreback added this to the 0.18.1 milestone Apr 28, 2016
@jreback jreback closed this in 15cc6e2 Apr 28, 2016
@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

thanks!

@Winand Winand deleted the encode_decode branch May 1, 2016 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants