PERF: Improved performance for .str.encode/decode #13008

Winand · 2016-04-27T11:53:48Z

I need such a patch to read huge sas tables encoded in cp1251. I'm not experienced enough to determine if such a patch is really needed here, but well.. it gives some speed in certain situations.

Optimize string encoding-decoding, leave default implementation for CPython optimized encodings,
(see https://docs.python.org/3.4/library/codecs.html#standard-encodings)

string

import pandas as pd
s1 = pd.Series(pd.util.testing.makeStringIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32', 'latin1', 'iso-8859-1', 'mbcs', 'ascii', 'cp1251', 'cp1252'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

unicode

import pandas as pd
s1 = pd.Series(pd.util.testing.makeUnicodeIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

("10 loops, best of 3: xxx ms per loop")

jreback · 2016-04-27T12:27:15Z

pandas/core/strings.py

@@ -1182,7 +1183,13 @@ def str_decode(arr, encoding, errors="strict"):
    -------
    decoded : Series/Index of objects
    """
-    f = lambda x: x.decode(encoding, errors)
+    if encoding in ("utf-8", "utf8", "latin-1", "latin1",


define these at the top of the file:

_cpython_optimized_encoding = .....

jreback · 2016-04-27T13:44:18Z

@Winand in your example you miss the point of categorical encoding:

In [2]: pd.util.testing.makeUnicodeIndex(k=100000).nunique()
Out[2]: 100000

categorical in general make sense if k << n. slight overhead when k == n.

have you tried read_sas?

Use default implementation for optimized encodings, see https://docs.python.org/3.4/library/codecs.html#standard-encodings

Winand · 2016-04-27T14:43:29Z

import pandas as pd
s1 = pd.Series(pd.util.testing.makeStringIndex(k=1000))
s1 = pd.concat([s1]*100).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32', 'latin1', 'iso-8859-1', 'mbcs', 'ascii', 'cp1251', 'cp1252'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

@jreback i tried read_sas from 0.18.0, it was more than twice slower than (slightly improved) Jared Hobbs' sas7bdat module. A lot of time is spent on decoding cp1251

jreback · 2016-04-27T14:53:35Z

try with master. its now 4x faster.

Winand · 2016-04-28T13:25:36Z

@jreback i've tried.) 2x faster on my test table, which is still amazing improvement

jreback · 2016-04-28T13:34:48Z

cc @kshedden

ok, seems reasonable.

jreback · 2016-04-28T13:40:00Z

thanks!

jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Apr 27, 2016

jreback reviewed Apr 27, 2016
View reviewed changes

Winand force-pushed the encode_decode branch from a5367a2 to 4842d26 Compare April 27, 2016 14:02

Improved performance for .str.encode/decode

4842d26

Use default implementation for optimized encodings, see https://docs.python.org/3.4/library/codecs.html#standard-encodings

jreback added this to the 0.18.1 milestone Apr 28, 2016

jreback closed this in 15cc6e2 Apr 28, 2016

Winand deleted the encode_decode branch May 1, 2016 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improved performance for .str.encode/decode #13008

PERF: Improved performance for .str.encode/decode #13008

Winand commented Apr 27, 2016

jreback Apr 27, 2016

jreback commented Apr 27, 2016

Winand commented Apr 27, 2016 •

edited

Loading

jreback commented Apr 27, 2016

Winand commented Apr 28, 2016 •

edited

Loading

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

PERF: Improved performance for .str.encode/decode #13008

PERF: Improved performance for .str.encode/decode #13008

Conversation

Winand commented Apr 27, 2016

jreback Apr 27, 2016

Choose a reason for hiding this comment

jreback commented Apr 27, 2016

Winand commented Apr 27, 2016 • edited Loading

jreback commented Apr 27, 2016

Winand commented Apr 28, 2016 • edited Loading

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

Winand commented Apr 27, 2016 •

edited

Loading

Winand commented Apr 28, 2016 •

edited

Loading