-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Allow str.split callers to skip expensive post-processing #35223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The abstraction isn't used anywhere else, so let's see if eliminating the function call helps performance at all.
This can be useful when the user has already ensured that the split will result in a clean rectangle, such as a complete list of well-formed IPv4 addresses being split on ".". It allows them to skip three Python iteration passes over the resulting data.
We can't skip the whole block because, when it starts, result is a numpy array of lists, but it must still be converted to a list of lists for expand to function as expected.
Just need a minute to write up a whatsnew entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its going to be easier to get this accepted to simply optimize this.
# required when expand=True is explicitly specified | ||
# not needed when inferred | ||
|
||
def cons_row(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could just write a short cython routine which does all of this
we already have some benchmarks for this, so should be easy to tell a good change. |
@wbadart can you address comments? |
Sure, I can take a whack at it. I'm a bit of a cython n00b, so I can't promise I'll be particularly speedy, but if you have any recommended docs or tutorials, that'd be a big help. |
@wbadart is this still active? |
We can close this out; I haven't had the bandwidth to tackle it and probably won't soon :/ |
This stackoverflow answer highlights a performance issue that is likely related. In the example data, there are rows of strings separated by commas with a varying number of elements. str.split with
|
Hey team-
I've got some string processing in my pipeline and have noticed that
Series.str.split
is a pretty significant bottleneck:(For reference, the rest of the function calls from this
pyinstrument
run take 5-10 seconds.)It seemed odd to me that it was spending more time post-processing the result (
_wrap_result
) than actually doing the work of splitting strings, so I looked into it. It turns out that when you useexpand=True
,_wrap_result
makes three full passes over the data (in Python iteration land!). From what I gather, the procedure is to make sure that each row of the resulting data frame ends up with the same number of columns. However, for my use case, I happen to know ahead of time that this will be true; my Series is full of well-formed IPv4 addresses (and from what I gather online, this is a pretty common use ofSeries.str.split
), so I know each split string list will be 4 elements long (one for each octet).This pull request offers an escape hatch for users like me who know that this expensive post-processing step is essentially a no-op, in the form of the
pad_sequences
argument. IfTrue
, the default, then the original procedure will be run. IfFalse
, it will be skipped in favor of simply casting the result from a numpy array of lists to a list of lists, to be crammed into a DataFrame a few lines down. It's up to the caller to determine ahead of time if the split will produce uniform-length sequences. (Ifexpand=False
, thenpad_sequences
has no effect.)Looking forward to hearing your thoughts!
Many thanks
Related to PERF: increase performance of str_split when returning a frame #10090
[ ] closes #xxxxtests added / passed
passes
black pandas
passes
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
EDIT: forgot to mention, I have some performance testing results. Here's the data:
Here's the original performance:
The first optimization I tried (before adding
pad_sequences
) was simply to inlinecons_row
to save the stack frame allocation (since the function wasn't being used anywhere but the comprehension body:Works out to a few hundred nanoseconds of time savings per row, or as you can see here, about 2.5 seconds for my 10 million row test set.
Finally, here are the results without those three passes over the result:
(Commit hash is a little different since I rebased onto upstream after running the test.)
So this is quite a bit faster now. If I have time, I'll keep digging to see if we can eliminate that
result.tolist()
call and go straight from the numpy-ish representation to a DataFrame (for now, I found that I needed thetolist()
in order forexpand=True
to retain its behavior).