PERF: increase performance of str_split when returning a frame #10090

cgevans · 2015-05-09T05:21:22Z

This simply removes the unnecessary Series() in the creation of the DataFrame for the split values in str_split. It vastly improves performance (20ms vs 3s for a series of 20,000 strings as a test) while not changing current behavior.

cgevans · 2015-05-09T09:06:51Z

Hmm... this actually causes significant problems when dealing with NaNs.

jreback · 2015-05-09T15:33:49Z

@cgevans this code path has changed in master a bit (though issue still remains).

jreback · 2015-05-21T13:30:16Z

@cgevans can you rebase and see where this is?

cgevans · 2015-05-24T21:31:44Z

The change in the code path means this particular fix would need to be rewritten, but it seems this would need further changes to avoid changing behavior. I've written a bit about the problem in #10081.

One solution is to change the DataFrame constructor such that it handles lists of lists differently, making it consistent with lists of Series.

The other, somewhat hackish method is to take the array that comes from str_split, change all NaNs to [None], construct a DataFrame, and then change all Nones to NaNs.

cgevans · 2015-05-25T09:06:50Z

This new version now tries to handle NaNs in a bit of a roundabout way, but still in a manner that will be fast.

jreback · 2015-05-26T10:41:55Z

pandas/core/strings.py

                cons = self.series._constructor_expanddim
-                data = [cons_row(x) for x in result]
-                return cons(data, index=index)
+                data = [x if (x is not np.nan) else [None] for x in result]


just redefine cons_row as a function. e.g. something like

def cons_row(x): if np.isnan(x): return [ x ] return x

you don't need to fill

jreback · 2015-06-02T10:51:53Z

pandas/core/strings.py

@@ -1090,7 +1090,11 @@ def _wrap_result_expand(self, result, expand=False):
        else:
            index = self.series.index
            if expand:
-                cons_row = self.series._constructor
+                def cons_row(x):
+                    if x is np.nan:


use np.isnan(x)

I initially tried that, but np.isnan can't handle strings. x here is either NaN or a list of strings. I also only want that conditional to be true if x is a single null value. I agree that is np.nan isn't ideal: what would be the better way of doing this?

jreback · 2015-06-02T10:53:26Z

pls add a release note. pls show the results of the benchmark for this (there is a join_split one, but you may have to add one)

jreback · 2015-06-04T10:29:22Z

you can do (you might want to see which is better perf):

I think this can ONLY? return lists or scalars?

if isinstance(list, x): # might need com.is_list_like if this can return list/tuple/ndarray/Series
    # ok
else:
    [ x ]
...

In [8]: x = [ 'foo' ]

In [9]: pd.lib.isscalar(x) and pd.isnull(x)
Out[9]: False

In [10]: x = 'foo'

In [11]: pd.lib.isscalar(x) and pd.isnull(x)
Out[11]: False

In [12]: x = np.nan

In [13]: pd.lib.isscalar(x) and pd.isnull(x)
Out[13]: True

jreback · 2015-06-04T10:29:59Z

pls also add a note to 0.16.2 performance imprv section.

cgevans · 2015-06-05T16:46:59Z

Here is the benchmark result:

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
strings_join_split_expand                    |  48.8970 | 1554.2669 |   0.0315 |
strings_join_split                           |  36.1670 |  36.3259 |   0.9956 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [5e65f02] : PERF: Increase performance of string split when expand=True
Base   [bc7d48f] : disable some deps on 3.2 build

jreback · 2015-06-05T16:49:06Z

doc/source/whatsnew/v0.16.2.txt

@@ -47,6 +47,7 @@ Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

 - Improved ``Series.resample`` performance with dtype=datetime64[ns] (:issue:`7754`)
+- Increase performance of string split when expand=True (:issue:`10081`)


make this str.split when expand=True

jreback · 2015-06-05T16:49:45Z

minor whatsnew note change. pls ping when revised and will merge.

cgevans · 2015-06-05T16:52:08Z

Done.

PERF: increase performance of str_split when returning a frame

jreback · 2015-06-05T16:54:14Z

excellent ty!

jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels May 9, 2015

jreback added this to the Next Major Release milestone May 9, 2015

cgevans force-pushed the fastsplit branch 2 times, most recently from a040940 to 4a2ac33 Compare May 25, 2015 12:51

jreback reviewed May 26, 2015
View reviewed changes

cgevans force-pushed the fastsplit branch from 4a2ac33 to c155059 Compare May 26, 2015 11:36

jreback reviewed Jun 2, 2015
View reviewed changes

jreback modified the milestones: 0.17.0, Next Major Release, 0.16.2 Jun 2, 2015

cgevans force-pushed the fastsplit branch 2 times, most recently from 729311e to 70038ac Compare June 5, 2015 16:45

jreback reviewed Jun 5, 2015
View reviewed changes

PERF: increase performance of string split when expand=True

86cccb0

cgevans force-pushed the fastsplit branch from 70038ac to 86cccb0 Compare June 5, 2015 16:51

jreback added a commit that referenced this pull request Jun 5, 2015

Merge pull request #10090 from cgevans/fastsplit

bfe0b99

PERF: increase performance of str_split when returning a frame

jreback merged commit bfe0b99 into pandas-dev:master Jun 5, 2015

cgevans deleted the fastsplit branch June 5, 2015 16:54

wbadart mentioned this pull request Jul 10, 2020

PERF: Allow str.split callers to skip expensive post-processing #35223

Closed

4 tasks

Uh oh!

PERF: increase performance of str_split when returning a frame #10090

PERF: increase performance of str_split when returning a frame #10090

Uh oh!

Conversation

cgevans commented May 9, 2015

Uh oh!

cgevans commented May 9, 2015

Uh oh!

jreback commented May 9, 2015

Uh oh!

jreback commented May 21, 2015

Uh oh!

cgevans commented May 24, 2015

Uh oh!

cgevans commented May 25, 2015

Uh oh!

jreback May 26, 2015

Choose a reason for hiding this comment

Uh oh!

jreback Jun 2, 2015

Choose a reason for hiding this comment

Uh oh!

cgevans Jun 4, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 2, 2015

Uh oh!

jreback commented Jun 4, 2015

Uh oh!

jreback commented Jun 4, 2015

Uh oh!

cgevans commented Jun 5, 2015

Uh oh!

jreback Jun 5, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 5, 2015

Uh oh!

cgevans commented Jun 5, 2015

Uh oh!

jreback commented Jun 5, 2015

Uh oh!

Uh oh!