Defer Series.str.get_dummies to pandas.get_dummies #26686

joshlk · 2019-06-06T08:59:59Z

closes ENH: Series.str.get_dummies should defer to pd.get_dummies and pass thru args #19618
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

joshlk · 2019-06-06T09:37:18Z

@jreback the test are failing due to an import but I didn't change this file:

pandas/core/reshape/reshape.py:17: in <module>
        from pandas.core.arrays import SparseArray
E   ImportError: cannot import name 'SparseArray'
The command "ci/run_tests.sh" exited with 4.

Do you know why this is?

TomAugspurger · 2019-06-06T11:34:53Z

Possibly a circular import. Can you `import pandas` locally?

…

On Thu, Jun 6, 2019 at 4:37 AM Josh Levy-Kramer ***@***.***> wrote: @jreback <https://github.com/jreback> the test are failing due to an import but I didn't change this file: pandas/core/reshape/reshape.py:17: in <module> from pandas.core.arrays import SparseArray E ImportError: cannot import name 'SparseArray' The command "ci/run_tests.sh" exited with 4. Do you know why this is? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26686?email_source=notifications&email_token=AAKAOIVQMBROVH2WU2HIIBDPZDLFLA5CNFSM4HU64FC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCJTNA#issuecomment-499423668>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIW6GQISYMAT5CXSX3LPZDLFLANCNFSM4HU64FCQ> .

joshlk · 2019-06-06T13:01:30Z

Thanks I think your right - it's a circular import.

Can you import pandas locally?

Do you mean try import pandas.core.reshape.reshape as reshape and then use reshape.get_dummies as this does not work?

Whats the best say to resolve this. Should I move the function to another python file?

codecov · 2019-06-07T12:08:48Z

Codecov Report

Merging #26686 into master will decrease coverage by 49.94%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #26686       +/-   ##
===========================================
- Coverage   91.87%   41.93%   -49.95%     
===========================================
  Files         174      174               
  Lines       50661    50697       +36     
===========================================
- Hits        46547    21260    -25287     
- Misses       4114    29437    +25323

Flag	Coverage Δ
#multiple	`?`
#single	`41.93% <0%> (+0.06%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/strings.py	`38.07% <0%> (-60.85%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/sparse/scipy_sparse.py	`10.14% <0%> (-89.86%)`	⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 891a419...70e077a. Read the comment docs.

codecov · 2019-06-07T12:08:49Z

Codecov Report

Merging #26686 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26686      +/-   ##
==========================================
- Coverage   91.87%   91.87%   -0.01%     
==========================================
  Files         174      179       +5     
  Lines       50661    50689      +28     
==========================================
+ Hits        46547    46569      +22     
- Misses       4114     4120       +6

Flag	Coverage Δ
#multiple	`90.46% <100%> (+0.05%)`	⬆️
#single	`41.1% <100%> (-0.77%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.92% <100%> (ø)`	⬆️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/feather_format.py	`25% <0%> (-64.48%)`	⬇️
pandas/io/parquet.py	`65.59% <0%> (-18.88%)`	⬇️
pandas/core/computation/check.py	`85.71% <0%> (-6.6%)`	⬇️
pandas/core/config_init.py	`92.91% <0%> (-3.94%)`	⬇️
pandas/compat/__init__.py	`92% <0%> (-1.55%)`	⬇️
pandas/io/gbq.py	`88.88% <0%> (-0.59%)`	⬇️
pandas/core/frame.py	`96.88% <0%> (-0.24%)`	⬇️
pandas/core/arrays/integer.py	`96.3% <0%> (-0.05%)`	⬇️
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 891a419...a98aa26. Read the comment docs.

joshlk · 2019-06-07T13:35:31Z

@TomAugspurger I resolved the issue by locally importing. I hope thats ok

TomAugspurger

How does the performance of this compare to the original?

TomAugspurger · 2019-06-07T19:01:54Z

pandas/core/strings.py

+        empty = np.empty(0, dtype='object')
+        return empty, empty
+
+    arr = arr.fillna('').astype('str')


Why is this astype needed?

astype is needed as otherwise Series.str.split fails when the datatype is not a string (e.g. NaN).

One of the example given in the docstring is:

>>> pd.Series(['a|b', np.nan, 'a|c']).str.get_dummies() a b c 0 1 1 0 1 0 0 0 2 1 0 1

jreback · 2019-06-07T21:57:44Z

pandas/core/strings.py

+    from pandas.core.reshape.reshape import get_dummies
+    from pandas import Series
+
+    if len(arr) == 0:


is this case not handled by get_dummies already?

np.concatenate fails with:

stacked = Series(np.concatenate(arr_split))
ValueError: need at least one array to concatenate

Is there a nicer way to detect an empty input and output empty results? Should this be handled by pandas.core.strings.StringMethods.get_dummies instead?

@jreback I have replaced if len(arr) == 0: return empty with Series(np.concatenate(arr_split)) if len(arr) > 0 else Series() which side steps np.concatenate when the input is empty. Is this satisfactory?

joshlk · 2019-06-11T08:48:31Z

How does the performance of this compare to the original?

s = pd.Series([*['a|b|asdf|s|d|d', np.nan, 'a|c']*10000])

%%timeit
s.str.get_dummies()

Original version:
155 ms ± 13.9 ms per loop

New:
55.9 ms ± 3.94 ms

TomAugspurger

How is the performance of this approach compared to master? Do we have a benchmark for it in asv_bench?

joshlk · 2019-06-27T13:42:40Z

How is the performance of this approach compared to master? Do we have a benchmark for it in asv_bench?

There is an existing benchmark strings.Dummies:

$ cd asv_bench
$ asv continuous upstream/master HEAD -b strings.Dummies
$ asv compare upstream/master HEAD

All benchmarks:

       before           after         ratio
     [891a419a]       [70e077ab]
     <master>         <defer_str_dummies_to_get_dummies>
       3.81±0.04s        7.47±0.7s    ~1.96  strings.Dummies.time_get_dummies

It shows that the new function takes about twice as long - which I suppose isn't great. However, the small test I did before (see previous comment) showed a speed increase. So maybe it depends on the test

TomAugspurger · 2019-07-08T15:22:25Z

Can you do a bit of research / profiling on when and why the new implementation is slower?

…

On Thu, Jun 27, 2019 at 8:42 AM Josh Levy-Kramer ***@***.***> wrote: How is the performance of this approach compared to master? Do we have a benchmark for it in asv_bench? There is an existing benchmark strings.Dummies: $ asv continuous upstream/master HEAD -s -b strings.Dummies $ asv compare upstream/master HEAD All benchmarks: before after ratio [891a419] [70e077a] <master> <defer_str_dummies_to_get_dummies> 3.81±0.04s 7.47±0.7s ~1.96 strings.Dummies.time_get_dummies It shows that the new function takes about twice as long - which I suppose isn't great. However, the small test I did before (see previous comment) showed a speed increase. So maybe it depends on the test — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26686?email_source=notifications&email_token=AAKAOIXMFFBAEM6TVYK4ZSTP4S7VVA5CNFSM4HU64FC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYXE62Q#issuecomment-506351466>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIRW4GI7UA2R7PZZYXLP4S7VVANCNFSM4HU64FCQ> .

WillAyd · 2019-08-26T02:47:56Z

@joshlk can you address merge conflict and question above?

jreback · 2019-09-08T16:41:12Z

closing as stale, if you'd like to continue pls ping to reopen.

rtbs-dev · 2019-12-26T15:12:36Z

Not sure if I'll have time to look into this, but chiming in to mention I came across this while facing the same issue. Implementing a pure-python text-rank for short text that exploits CSR matrix for document-term (based on this, and originally hoped to use the str.get_dummies() function since it accepts separator arguments. But I ran into the same memory error with int64.

For now having to fall back on scikit-learn for document-term matrix computation, but the get_dummies function is so straight-forward/streamlined for short text (e.g. twitter) that interfacing with sparse arrays and networkx is really nice...no need for full-on sklearn dependencies in small utilities, etc.

If I do get some time to mess with this I'll see what I can do, but in the mean time just a vote for how convenient merging the two get_dummies() behaviors would be.

Defer Series.str.get_dummies to pandas.get_dummies

d95a42b

joshlk mentioned this pull request Jun 6, 2019

ENH: Series.str.get_dummies should defer to pd.get_dummies and pass thru args #19618

Open

gfyoung added Categorical Categorical Data Type Enhancement labels Jun 6, 2019

joshlk added 3 commits June 7, 2019 09:26

Properly handle NaN

7ccaf47

Move imports locally to resolve circular import conflict

11e168f

Deal with empty input data

70e077a

TomAugspurger reviewed Jun 7, 2019

View reviewed changes

jreback requested changes Jun 7, 2019

View reviewed changes

Deal with empty input data

a98aa26

TomAugspurger reviewed Jun 17, 2019

View reviewed changes

jreback closed this Sep 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer Series.str.get_dummies to pandas.get_dummies #26686

Defer Series.str.get_dummies to pandas.get_dummies #26686

joshlk commented Jun 6, 2019 •

edited

Loading

joshlk commented Jun 6, 2019

TomAugspurger commented Jun 6, 2019 via email

joshlk commented Jun 6, 2019

codecov bot commented Jun 7, 2019

codecov bot commented Jun 7, 2019 •

edited

Loading

joshlk commented Jun 7, 2019

TomAugspurger left a comment

TomAugspurger Jun 7, 2019

joshlk Jun 11, 2019

jreback Jun 7, 2019

joshlk Jun 11, 2019

joshlk Jun 11, 2019

joshlk Jun 16, 2019

joshlk commented Jun 11, 2019

TomAugspurger left a comment

joshlk commented Jun 27, 2019 •

edited

Loading

TomAugspurger commented Jul 8, 2019 via email

WillAyd commented Aug 26, 2019

jreback commented Sep 8, 2019

rtbs-dev commented Dec 26, 2019 •

edited

Loading

Defer Series.str.get_dummies to pandas.get_dummies #26686

Defer Series.str.get_dummies to pandas.get_dummies #26686

Conversation

joshlk commented Jun 6, 2019 • edited Loading

joshlk commented Jun 6, 2019

TomAugspurger commented Jun 6, 2019 via email

joshlk commented Jun 6, 2019

codecov bot commented Jun 7, 2019

Codecov Report

codecov bot commented Jun 7, 2019 • edited Loading

Codecov Report

joshlk commented Jun 7, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Jun 7, 2019

Choose a reason for hiding this comment

joshlk Jun 11, 2019

Choose a reason for hiding this comment

jreback Jun 7, 2019

Choose a reason for hiding this comment

joshlk Jun 11, 2019

Choose a reason for hiding this comment

joshlk Jun 11, 2019

Choose a reason for hiding this comment

joshlk Jun 16, 2019

Choose a reason for hiding this comment

joshlk commented Jun 11, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

joshlk commented Jun 27, 2019 • edited Loading

TomAugspurger commented Jul 8, 2019 via email

WillAyd commented Aug 26, 2019

jreback commented Sep 8, 2019

rtbs-dev commented Dec 26, 2019 • edited Loading

joshlk commented Jun 6, 2019 •

edited

Loading

codecov bot commented Jun 7, 2019 •

edited

Loading

joshlk commented Jun 27, 2019 •

edited

Loading

rtbs-dev commented Dec 26, 2019 •

edited

Loading