Skip to content

Defer Series.str.get_dummies to pandas.get_dummies #26686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

joshlk
Copy link

@joshlk joshlk commented Jun 6, 2019

@joshlk
Copy link
Author

joshlk commented Jun 6, 2019

@jreback the test are failing due to an import but I didn't change this file:

pandas/core/reshape/reshape.py:17: in <module>
        from pandas.core.arrays import SparseArray
E   ImportError: cannot import name 'SparseArray'
The command "ci/run_tests.sh" exited with 4.

Do you know why this is?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 6, 2019 via email

@joshlk
Copy link
Author

joshlk commented Jun 6, 2019

Thanks I think your right - it's a circular import.

Can you import pandas locally?

Do you mean try import pandas.core.reshape.reshape as reshape and then use reshape.get_dummies as this does not work?

Whats the best say to resolve this. Should I move the function to another python file?

@gfyoung gfyoung added Categorical Categorical Data Type Enhancement labels Jun 6, 2019
@codecov
Copy link

codecov bot commented Jun 7, 2019

Codecov Report

Merging #26686 into master will decrease coverage by 49.94%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26686       +/-   ##
===========================================
- Coverage   91.87%   41.93%   -49.95%     
===========================================
  Files         174      174               
  Lines       50661    50697       +36     
===========================================
- Hits        46547    21260    -25287     
- Misses       4114    29437    +25323
Flag Coverage Δ
#multiple ?
#single 41.93% <0%> (+0.06%) ⬆️
Impacted Files Coverage Δ
pandas/core/strings.py 38.07% <0%> (-60.85%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 891a419...70e077a. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 7, 2019

Codecov Report

Merging #26686 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26686      +/-   ##
==========================================
- Coverage   91.87%   91.87%   -0.01%     
==========================================
  Files         174      179       +5     
  Lines       50661    50689      +28     
==========================================
+ Hits        46547    46569      +22     
- Misses       4114     4120       +6
Flag Coverage Δ
#multiple 90.46% <100%> (+0.05%) ⬆️
#single 41.1% <100%> (-0.77%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.92% <100%> (ø) ⬆️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/feather_format.py 25% <0%> (-64.48%) ⬇️
pandas/io/parquet.py 65.59% <0%> (-18.88%) ⬇️
pandas/core/computation/check.py 85.71% <0%> (-6.6%) ⬇️
pandas/core/config_init.py 92.91% <0%> (-3.94%) ⬇️
pandas/compat/__init__.py 92% <0%> (-1.55%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-0.59%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.24%) ⬇️
pandas/core/arrays/integer.py 96.3% <0%> (-0.05%) ⬇️
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 891a419...a98aa26. Read the comment docs.

@joshlk
Copy link
Author

joshlk commented Jun 7, 2019

@TomAugspurger I resolved the issue by locally importing. I hope thats ok

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the performance of this compare to the original?

empty = np.empty(0, dtype='object')
return empty, empty

arr = arr.fillna('').astype('str')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this astype needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astype is needed as otherwise Series.str.split fails when the datatype is not a string (e.g. NaN).

One of the example given in the docstring is:

>>> pd.Series(['a|b', np.nan, 'a|c']).str.get_dummies()
a  b  c
0  1  1  0
1  0  0  0
2  1  0  1

from pandas.core.reshape.reshape import get_dummies
from pandas import Series

if len(arr) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this case not handled by get_dummies already?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.concatenate fails with:

stacked = Series(np.concatenate(arr_split))
ValueError: need at least one array to concatenate

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a nicer way to detect an empty input and output empty results? Should this be handled by pandas.core.strings.StringMethods.get_dummies instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I have replaced if len(arr) == 0: return empty with Series(np.concatenate(arr_split)) if len(arr) > 0 else Series() which side steps np.concatenate when the input is empty. Is this satisfactory?

@joshlk
Copy link
Author

joshlk commented Jun 11, 2019

How does the performance of this compare to the original?

s = pd.Series([*['a|b|asdf|s|d|d', np.nan, 'a|c']*10000])

%%timeit
s.str.get_dummies()

Original version:
155 ms ± 13.9 ms per loop

New:
55.9 ms ± 3.94 ms

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the performance of this approach compared to master? Do we have a benchmark for it in asv_bench?

@joshlk
Copy link
Author

joshlk commented Jun 27, 2019

How is the performance of this approach compared to master? Do we have a benchmark for it in asv_bench?

There is an existing benchmark strings.Dummies:

$ cd asv_bench
$ asv continuous upstream/master HEAD -b strings.Dummies
$ asv compare upstream/master HEAD

All benchmarks:

       before           after         ratio
     [891a419a]       [70e077ab]
     <master>         <defer_str_dummies_to_get_dummies>
       3.81±0.04s        7.47±0.7s    ~1.96  strings.Dummies.time_get_dummies

It shows that the new function takes about twice as long - which I suppose isn't great. However, the small test I did before (see previous comment) showed a speed increase. So maybe it depends on the test

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 8, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2019

@joshlk can you address merge conflict and question above?

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

closing as stale, if you'd like to continue pls ping to reopen.

@jreback jreback closed this Sep 8, 2019
@rtbs-dev
Copy link

rtbs-dev commented Dec 26, 2019

Not sure if I'll have time to look into this, but chiming in to mention I came across this while facing the same issue. Implementing a pure-python text-rank for short text that exploits CSR matrix for document-term (based on this, and originally hoped to use the str.get_dummies() function since it accepts separator arguments. But I ran into the same memory error with int64.

For now having to fall back on scikit-learn for document-term matrix computation, but the get_dummies function is so straight-forward/streamlined for short text (e.g. twitter) that interfacing with sparse arrays and networkx is really nice...no need for full-on sklearn dependencies in small utilities, etc.

If I do get some time to mess with this I'll see what I can do, but in the mean time just a vote for how convenient merging the two get_dummies() behaviors would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Series.str.get_dummies should defer to pd.get_dummies and pass thru args
6 participants