Skip to content

PERF: Allow str.split callers to skip expensive post-processing #35223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

wbadart
Copy link

@wbadart wbadart commented Jul 10, 2020

Hey team-

I've got some string processing in my pipeline and have noticed that Series.str.split is a pretty significant bottleneck:

Screen Shot 2020-07-09 at 18 03 09

(For reference, the rest of the function calls from this pyinstrument run take 5-10 seconds.)

It seemed odd to me that it was spending more time post-processing the result (_wrap_result) than actually doing the work of splitting strings, so I looked into it. It turns out that when you use expand=True, _wrap_result makes three full passes over the data (in Python iteration land!). From what I gather, the procedure is to make sure that each row of the resulting data frame ends up with the same number of columns. However, for my use case, I happen to know ahead of time that this will be true; my Series is full of well-formed IPv4 addresses (and from what I gather online, this is a pretty common use of Series.str.split), so I know each split string list will be 4 elements long (one for each octet).

This pull request offers an escape hatch for users like me who know that this expensive post-processing step is essentially a no-op, in the form of the pad_sequences argument. If True, the default, then the original procedure will be run. If False, it will be skipped in favor of simply casting the result from a numpy array of lists to a list of lists, to be crammed into a DataFrame a few lines down. It's up to the caller to determine ahead of time if the split will produce uniform-length sequences. (If expand=False, then pad_sequences has no effect.)

Looking forward to hearing your thoughts!

Many thanks



EDIT: forgot to mention, I have some performance testing results. Here's the data:

$ wc -l test.csv
10663901 test.csv

$ du -h test.csv
928M	test.csv

Here's the original performance:

In [1]: pd.__version__
Out[1]: '1.1.0.dev0+2067.g2c3edaaaa'

In [2]: %%timeit
   ...: df.ip.str.split(".", expand=True)
   ...:
   ...:
26.7 s ± 2.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

The first optimization I tried (before adding pad_sequences) was simply to inline cons_row to save the stack frame allocation (since the function wasn't being used anywhere but the comprehension body:

In [1]: pd.__version__
Out[1]: '1.1.0.dev0+2068.g15623dbe7'

In [2]: %%timeit
   ...: df.ip.str.split(".", expand=True)
   ...:
   ...:
24 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Works out to a few hundred nanoseconds of time savings per row, or as you can see here, about 2.5 seconds for my 10 million row test set.

Finally, here are the results without those three passes over the result:

In [2]: pd.__version__
Out[2]: '1.1.0.dev0+2073.g23f88eedb'

In [5]: %%timeit
   ...: df.ip.str.split(".", expand=True, pad_sequences=False)
   ...:
   ...:
15.5 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(Commit hash is a little different since I rebased onto upstream after running the test.)

So this is quite a bit faster now. If I have time, I'll keep digging to see if we can eliminate that result.tolist() call and go straight from the numpy-ish representation to a DataFrame (for now, I found that I needed the tolist() in order for expand=True to retain its behavior).

wbadart added 6 commits July 10, 2020 16:28
The abstraction isn't used anywhere else, so let's see if eliminating
the function call helps performance at all.
This can be useful when the user has already ensured that the split will
result in a clean rectangle, such as a complete list of well-formed IPv4
addresses being split on ".". It allows them to skip three Python
iteration passes over the resulting data.
We can't skip the whole block because, when it starts, result is a numpy
array of lists, but it must still be converted to a list of lists for
expand to function as expected.
@wbadart
Copy link
Author

wbadart commented Jul 10, 2020

Just need a minute to write up a whatsnew entry

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its going to be easier to get this accepted to simply optimize this.

# required when expand=True is explicitly specified
# not needed when inferred

def cons_row(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just write a short cython routine which does all of this

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Jul 11, 2020
@jreback
Copy link
Contributor

jreback commented Jul 11, 2020

we already have some benchmarks for this, so should be easy to tell a good change.

@simonjayhawkins
Copy link
Member

@wbadart can you address comments?

@wbadart
Copy link
Author

wbadart commented Jul 27, 2020

Sure, I can take a whack at it. I'm a bit of a cython n00b, so I can't promise I'll be particularly speedy, but if you have any recommended docs or tutorials, that'd be a big help.

@WillAyd
Copy link
Member

WillAyd commented Sep 30, 2020

@wbadart is this still active?

@wbadart
Copy link
Author

wbadart commented Sep 30, 2020

We can close this out; I haven't had the bandwidth to tackle it and probably won't soon :/

@wbadart wbadart closed this Sep 30, 2020
@w-m
Copy link

w-m commented Aug 30, 2021

This stackoverflow answer highlights a performance issue that is likely related. In the example data, there are rows of strings separated by commas with a varying number of elements. str.split with expand=True is about 2x slower than str.split with expand=False followed by to_list() and creating a new DataFrame from the results:

df = pd.DataFrame(["a", "a,b", "a,b,c"] * 100000, columns=["steps"])

%timeit res = df.steps.str.split(",", expand=True)
249 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit res2 = pd.DataFrame(df.steps.str.split(",").to_list())
128 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Stale Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants