-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Defer Series.str.get_dummies to pandas.get_dummies #26686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jreback the test are failing due to an import but I didn't change this file:
Do you know why this is? |
Possibly a circular import. Can you `import pandas` locally?
…On Thu, Jun 6, 2019 at 4:37 AM Josh Levy-Kramer ***@***.***> wrote:
@jreback <https://github.com/jreback> the test are failing due to an
import but I didn't change this file:
pandas/core/reshape/reshape.py:17: in <module>
from pandas.core.arrays import SparseArray
E ImportError: cannot import name 'SparseArray'
The command "ci/run_tests.sh" exited with 4.
Do you know why this is?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#26686?email_source=notifications&email_token=AAKAOIVQMBROVH2WU2HIIBDPZDLFLA5CNFSM4HU64FC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCJTNA#issuecomment-499423668>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIW6GQISYMAT5CXSX3LPZDLFLANCNFSM4HU64FCQ>
.
|
Thanks I think your right - it's a circular import.
Do you mean try Whats the best say to resolve this. Should I move the function to another python file? |
Codecov Report
@@ Coverage Diff @@
## master #26686 +/- ##
===========================================
- Coverage 91.87% 41.93% -49.95%
===========================================
Files 174 174
Lines 50661 50697 +36
===========================================
- Hits 46547 21260 -25287
- Misses 4114 29437 +25323
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26686 +/- ##
==========================================
- Coverage 91.87% 91.87% -0.01%
==========================================
Files 174 179 +5
Lines 50661 50689 +28
==========================================
+ Hits 46547 46569 +22
- Misses 4114 4120 +6
Continue to review full report at Codecov.
|
@TomAugspurger I resolved the issue by locally importing. I hope thats ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the performance of this compare to the original?
empty = np.empty(0, dtype='object') | ||
return empty, empty | ||
|
||
arr = arr.fillna('').astype('str') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this astype needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astype
is needed as otherwise Series.str.split
fails when the datatype is not a string (e.g. NaN).
One of the example given in the docstring is:
>>> pd.Series(['a|b', np.nan, 'a|c']).str.get_dummies()
a b c
0 1 1 0
1 0 0 0
2 1 0 1
pandas/core/strings.py
Outdated
from pandas.core.reshape.reshape import get_dummies | ||
from pandas import Series | ||
|
||
if len(arr) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this case not handled by get_dummies already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.concatenate
fails with:
stacked = Series(np.concatenate(arr_split))
ValueError: need at least one array to concatenate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a nicer way to detect an empty input and output empty results? Should this be handled by pandas.core.strings.StringMethods.get_dummies
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I have replaced if len(arr) == 0: return empty
with Series(np.concatenate(arr_split)) if len(arr) > 0 else Series()
which side steps np.concatenate
when the input is empty. Is this satisfactory?
Original version: New: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the performance of this approach compared to master? Do we have a benchmark for it in asv_bench
?
There is an existing benchmark $ cd asv_bench
$ asv continuous upstream/master HEAD -b strings.Dummies
$ asv compare upstream/master HEAD
All benchmarks:
before after ratio
[891a419a] [70e077ab]
<master> <defer_str_dummies_to_get_dummies>
3.81±0.04s 7.47±0.7s ~1.96 strings.Dummies.time_get_dummies It shows that the new function takes about twice as long - which I suppose isn't great. However, the small test I did before (see previous comment) showed a speed increase. So maybe it depends on the test |
Can you do a bit of research / profiling on when and why the new
implementation is slower?
…On Thu, Jun 27, 2019 at 8:42 AM Josh Levy-Kramer ***@***.***> wrote:
How is the performance of this approach compared to master? Do we have a
benchmark for it in asv_bench?
There is an existing benchmark strings.Dummies:
$ asv continuous upstream/master HEAD -s -b strings.Dummies
$ asv compare upstream/master HEAD
All benchmarks:
before after ratio
[891a419] [70e077a]
<master> <defer_str_dummies_to_get_dummies>
3.81±0.04s 7.47±0.7s ~1.96 strings.Dummies.time_get_dummies
It shows that the new function takes about twice as long - which I suppose
isn't great. However, the small test I did before (see previous comment)
showed a speed increase. So maybe it depends on the test
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26686?email_source=notifications&email_token=AAKAOIXMFFBAEM6TVYK4ZSTP4S7VVA5CNFSM4HU64FC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYXE62Q#issuecomment-506351466>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIRW4GI7UA2R7PZZYXLP4S7VVANCNFSM4HU64FCQ>
.
|
@joshlk can you address merge conflict and question above? |
closing as stale, if you'd like to continue pls ping to reopen. |
Not sure if I'll have time to look into this, but chiming in to mention I came across this while facing the same issue. Implementing a pure-python text-rank for short text that exploits CSR matrix for document-term (based on this, and originally hoped to use the For now having to fall back on scikit-learn for document-term matrix computation, but the get_dummies function is so straight-forward/streamlined for short text (e.g. twitter) that interfacing with sparse arrays and networkx is really nice...no need for full-on sklearn dependencies in small utilities, etc. If I do get some time to mess with this I'll see what I can do, but in the mean time just a vote for how convenient merging the two |
git diff upstream/master -u -- "*.py" | flake8 --diff