PERF: Allow str.split callers to skip expensive post-processing #35223

wbadart · 2020-07-10T23:42:52Z

Hey team-

I've got some string processing in my pipeline and have noticed that Series.str.split is a pretty significant bottleneck:

(For reference, the rest of the function calls from this pyinstrument run take 5-10 seconds.)

It seemed odd to me that it was spending more time post-processing the result (_wrap_result) than actually doing the work of splitting strings, so I looked into it. It turns out that when you use expand=True, _wrap_result makes three full passes over the data (in Python iteration land!). From what I gather, the procedure is to make sure that each row of the resulting data frame ends up with the same number of columns. However, for my use case, I happen to know ahead of time that this will be true; my Series is full of well-formed IPv4 addresses (and from what I gather online, this is a pretty common use of Series.str.split), so I know each split string list will be 4 elements long (one for each octet).

This pull request offers an escape hatch for users like me who know that this expensive post-processing step is essentially a no-op, in the form of the pad_sequences argument. If True, the default, then the original procedure will be run. If False, it will be skipped in favor of simply casting the result from a numpy array of lists to a list of lists, to be crammed into a DataFrame a few lines down. It's up to the caller to determine ahead of time if the split will produce uniform-length sequences. (If expand=False, then pad_sequences has no effect.)

Looking forward to hearing your thoughts!

Many thanks

Related to PERF: increase performance of str_split when returning a frame #10090
~~[ ] closes #xxxx~~
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

EDIT: forgot to mention, I have some performance testing results. Here's the data:

$ wc -l test.csv
10663901 test.csv

$ du -h test.csv
928M	test.csv

Here's the original performance:

In [1]: pd.__version__
Out[1]: '1.1.0.dev0+2067.g2c3edaaaa'

In [2]: %%timeit
   ...: df.ip.str.split(".", expand=True)
   ...:
   ...:
26.7 s ± 2.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

The first optimization I tried (before adding pad_sequences) was simply to inline cons_row to save the stack frame allocation (since the function wasn't being used anywhere but the comprehension body:

In [1]: pd.__version__
Out[1]: '1.1.0.dev0+2068.g15623dbe7'

In [2]: %%timeit
   ...: df.ip.str.split(".", expand=True)
   ...:
   ...:
24 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Works out to a few hundred nanoseconds of time savings per row, or as you can see here, about 2.5 seconds for my 10 million row test set.

Finally, here are the results without those three passes over the result:

In [2]: pd.__version__
Out[2]: '1.1.0.dev0+2073.g23f88eedb'

In [5]: %%timeit
   ...: df.ip.str.split(".", expand=True, pad_sequences=False)
   ...:
   ...:
15.5 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(Commit hash is a little different since I rebased onto upstream after running the test.)

So this is quite a bit faster now. If I have time, I'll keep digging to see if we can eliminate that result.tolist() call and go straight from the numpy-ish representation to a DataFrame (for now, I found that I needed the tolist() in order for expand=True to retain its behavior).

The abstraction isn't used anywhere else, so let's see if eliminating the function call helps performance at all.

This can be useful when the user has already ensured that the split will result in a clean rectangle, such as a complete list of well-formed IPv4 addresses being split on ".". It allows them to skip three Python iteration passes over the resulting data.

We can't skip the whole block because, when it starts, result is a numpy array of lists, but it must still be converted to a list of lists for expand to function as expected.

wbadart · 2020-07-10T23:44:52Z

Just need a minute to write up a whatsnew entry

jreback

its going to be easier to get this accepted to simply optimize this.

jreback · 2020-07-11T00:00:00Z

pandas/core/strings.py

-            # required when expand=True is explicitly specified
-            # not needed when inferred
-
-            def cons_row(x):


you could just write a short cython routine which does all of this

jreback · 2020-07-11T00:07:31Z

we already have some benchmarks for this, so should be easy to tell a good change.

simonjayhawkins · 2020-07-24T11:54:55Z

@wbadart can you address comments?

wbadart · 2020-07-27T21:56:04Z

Sure, I can take a whack at it. I'm a bit of a cython n00b, so I can't promise I'll be particularly speedy, but if you have any recommended docs or tutorials, that'd be a big help.

WillAyd · 2020-09-30T15:45:30Z

@wbadart is this still active?

wbadart · 2020-09-30T21:13:15Z

We can close this out; I haven't had the bandwidth to tackle it and probably won't soon :/

w-m · 2021-08-30T22:00:01Z

This stackoverflow answer highlights a performance issue that is likely related. In the example data, there are rows of strings separated by commas with a varying number of elements. str.split with expand=True is about 2x slower than str.split with expand=False followed by to_list() and creating a new DataFrame from the results:

df = pd.DataFrame(["a", "a,b", "a,b,c"] * 100000, columns=["steps"])

%timeit res = df.steps.str.split(",", expand=True)
249 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit res2 = pd.DataFrame(df.steps.str.split(",").to_list())
128 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

wbadart added 6 commits July 10, 2020 16:28

Inline cons_row

60cc735

The abstraction isn't used anywhere else, so let's see if eliminating the function call helps performance at all.

Apply black formatting

06505c0

Corrected placement of pad_sequences documentation

ae81973

Fix pad_sequences logic

cd233bc

We can't skip the whole block because, when it starts, result is a numpy array of lists, but it must still be converted to a list of lists for expand to function as expected.

Add example of pad_sequences usage to split docs

e242615

jreback requested changes Jul 10, 2020

View reviewed changes

jreback reviewed Jul 11, 2020

View reviewed changes

jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Jul 11, 2020

simonjayhawkins mentioned this pull request Sep 15, 2020

CI: Add stale PR action #36336

Merged

dsaxton added the Stale label Sep 15, 2020

wbadart closed this Sep 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Allow str.split callers to skip expensive post-processing #35223

PERF: Allow str.split callers to skip expensive post-processing #35223

Uh oh!

wbadart commented Jul 10, 2020 •

edited

Loading

Uh oh!

wbadart commented Jul 10, 2020

Uh oh!

jreback left a comment

Uh oh!

jreback Jul 11, 2020

Uh oh!

jreback commented Jul 11, 2020

Uh oh!

simonjayhawkins commented Jul 24, 2020

Uh oh!

wbadart commented Jul 27, 2020

Uh oh!

WillAyd commented Sep 30, 2020

Uh oh!

wbadart commented Sep 30, 2020

Uh oh!

w-m commented Aug 30, 2021

Uh oh!

Uh oh!

Uh oh!

PERF: Allow str.split callers to skip expensive post-processing #35223

PERF: Allow str.split callers to skip expensive post-processing #35223

Uh oh!

Conversation

wbadart commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbadart commented Jul 10, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jul 11, 2020

Choose a reason for hiding this comment

Uh oh!

jreback commented Jul 11, 2020

Uh oh!

simonjayhawkins commented Jul 24, 2020

Uh oh!

wbadart commented Jul 27, 2020

Uh oh!

WillAyd commented Sep 30, 2020

Uh oh!

wbadart commented Sep 30, 2020

Uh oh!

w-m commented Aug 30, 2021

Uh oh!

Uh oh!

wbadart commented Jul 10, 2020 •

edited

Loading