Skip to content

PERF: increase performance of str_split when returning a frame #10090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 5, 2015

Conversation

cgevans
Copy link
Contributor

@cgevans cgevans commented May 9, 2015

Closes #10081.

This simply removes the unnecessary Series() in the creation of the DataFrame for the split values in str_split. It vastly improves performance (20ms vs 3s for a series of 20,000 strings as a test) while not changing current behavior.

@cgevans
Copy link
Contributor Author

cgevans commented May 9, 2015

Hmm... this actually causes significant problems when dealing with NaNs.

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels May 9, 2015
@jreback jreback added this to the Next Major Release milestone May 9, 2015
@jreback
Copy link
Contributor

jreback commented May 9, 2015

@cgevans this code path has changed in master a bit (though issue still remains).

@jreback
Copy link
Contributor

jreback commented May 21, 2015

@cgevans can you rebase and see where this is?

@cgevans
Copy link
Contributor Author

cgevans commented May 24, 2015

The change in the code path means this particular fix would need to be rewritten, but it seems this would need further changes to avoid changing behavior. I've written a bit about the problem in #10081.

One solution is to change the DataFrame constructor such that it handles lists of lists differently, making it consistent with lists of Series.

The other, somewhat hackish method is to take the array that comes from str_split, change all NaNs to [None], construct a DataFrame, and then change all Nones to NaNs.

@cgevans
Copy link
Contributor Author

cgevans commented May 25, 2015

This new version now tries to handle NaNs in a bit of a roundabout way, but still in a manner that will be fast.

@cgevans cgevans force-pushed the fastsplit branch 2 times, most recently from a040940 to 4a2ac33 Compare May 25, 2015 12:51
cons = self.series._constructor_expanddim
data = [cons_row(x) for x in result]
return cons(data, index=index)
data = [x if (x is not np.nan) else [None] for x in result]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just redefine cons_row as a function. e.g. something like

def cons_row(x):
    if np.isnan(x):
         return [ x ]
    return x

you don't need to fill

@@ -1090,7 +1090,11 @@ def _wrap_result_expand(self, result, expand=False):
else:
index = self.series.index
if expand:
cons_row = self.series._constructor
def cons_row(x):
if x is np.nan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.isnan(x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried that, but np.isnan can't handle strings. x here is either NaN or a list of strings. I also only want that conditional to be true if x is a single null value. I agree that is np.nan isn't ideal: what would be the better way of doing this?

@jreback
Copy link
Contributor

jreback commented Jun 2, 2015

pls add a release note. pls show the results of the benchmark for this (there is a join_split one, but you may have to add one)

@jreback jreback modified the milestones: 0.17.0, Next Major Release, 0.16.2 Jun 2, 2015
@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

you can do (you might want to see which is better perf):

I think this can ONLY? return lists or scalars?

if isinstance(list, x): # might need com.is_list_like if this can return list/tuple/ndarray/Series
    # ok
else:
    [ x ]
...
In [8]: x = [ 'foo' ]

In [9]: pd.lib.isscalar(x) and pd.isnull(x)
Out[9]: False

In [10]: x = 'foo'

In [11]: pd.lib.isscalar(x) and pd.isnull(x)
Out[11]: False

In [12]: x = np.nan

In [13]: pd.lib.isscalar(x) and pd.isnull(x)
Out[13]: True

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

pls also add a note to 0.16.2 performance imprv section.

@cgevans cgevans force-pushed the fastsplit branch 2 times, most recently from 729311e to 70038ac Compare June 5, 2015 16:45
@cgevans
Copy link
Contributor Author

cgevans commented Jun 5, 2015

Here is the benchmark result:

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
strings_join_split_expand                    |  48.8970 | 1554.2669 |   0.0315 |
strings_join_split                           |  36.1670 |  36.3259 |   0.9956 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [5e65f02] : PERF: Increase performance of string split when expand=True
Base   [bc7d48f] : disable some deps on 3.2 build

@@ -47,6 +47,7 @@ Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved ``Series.resample`` performance with dtype=datetime64[ns] (:issue:`7754`)
- Increase performance of string split when expand=True (:issue:`10081`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this str.split when expand=True

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

minor whatsnew note change. pls ping when revised and will merge.

@cgevans
Copy link
Contributor Author

cgevans commented Jun 5, 2015

Done.

jreback added a commit that referenced this pull request Jun 5, 2015
PERF: increase performance of str_split when returning a frame
@jreback jreback merged commit bfe0b99 into pandas-dev:master Jun 5, 2015
@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

excellent ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: speed up certain string operations
2 participants