-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: increase performance of str_split when returning a frame #10090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm... this actually causes significant problems when dealing with NaNs. |
@cgevans this code path has changed in master a bit (though issue still remains). |
@cgevans can you rebase and see where this is? |
The change in the code path means this particular fix would need to be rewritten, but it seems this would need further changes to avoid changing behavior. I've written a bit about the problem in #10081. One solution is to change the DataFrame constructor such that it handles lists of lists differently, making it consistent with lists of Series. The other, somewhat hackish method is to take the array that comes from str_split, change all NaNs to [None], construct a DataFrame, and then change all Nones to NaNs. |
This new version now tries to handle NaNs in a bit of a roundabout way, but still in a manner that will be fast. |
a040940
to
4a2ac33
Compare
cons = self.series._constructor_expanddim | ||
data = [cons_row(x) for x in result] | ||
return cons(data, index=index) | ||
data = [x if (x is not np.nan) else [None] for x in result] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just redefine cons_row
as a function. e.g. something like
def cons_row(x):
if np.isnan(x):
return [ x ]
return x
you don't need to fill
@@ -1090,7 +1090,11 @@ def _wrap_result_expand(self, result, expand=False): | |||
else: | |||
index = self.series.index | |||
if expand: | |||
cons_row = self.series._constructor | |||
def cons_row(x): | |||
if x is np.nan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.isnan(x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially tried that, but np.isnan
can't handle strings. x
here is either NaN or a list of strings. I also only want that conditional to be true if x
is a single null value. I agree that is np.nan
isn't ideal: what would be the better way of doing this?
pls add a release note. pls show the results of the benchmark for this (there is a join_split one, but you may have to add one) |
you can do (you might want to see which is better perf): I think this can ONLY? return lists or scalars?
|
pls also add a note to 0.16.2 performance imprv section. |
729311e
to
70038ac
Compare
Here is the benchmark result:
|
@@ -47,6 +47,7 @@ Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- Improved ``Series.resample`` performance with dtype=datetime64[ns] (:issue:`7754`) | |||
- Increase performance of string split when expand=True (:issue:`10081`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this str.split
when expand=True
minor whatsnew note change. pls ping when revised and will merge. |
Done. |
PERF: increase performance of str_split when returning a frame
excellent ty! |
Closes #10081.
This simply removes the unnecessary
Series()
in the creation of the DataFrame for the split values instr_split
. It vastly improves performance (20ms vs 3s for a series of 20,000 strings as a test) while not changing current behavior.