-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: get_dummies not returning SparseDataFrame #10535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jreback Perhaps this would be more elegantly fixed by having |
xref #10536 |
I think #10536 should be fixed first. But this introduces the problem of when/how to convert a concat of series (that happen to be all SparseSeries) into a SparseDataFrame (as opposed to a DataFrame). |
@@ -957,13 +957,15 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
If `columns` is None then all the columns with | |||
`object` or `category` dtype will be converted. | |||
sparse : bool, default False | |||
Whether the returned DataFrame should be sparse or not. | |||
Whether the dummy columns should be sparse or not. Returns | |||
SparseDataFrame if `data` is a Series or if all columns are included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing, I think we should just always return a SparseDataFrame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case when some of the blocks are dense ? e.g.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a':['A','B','C'],'b':['D','E','D']})
In [3]: pd.get_dummies(df, sparse=True, columns='b')
Out[3]:
a b_D b_E
0 A 1 0
1 B 0 1
2 C 1 0
Here a
is still dense. Is a df where some blocks are dense and some sparse more accurately called a DataFrame
or a SparseDataFrame
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback ?
7296068
to
ec58429
Compare
self.assertEqual(type(r[['a_0']]._data.blocks[0]), exp_blk_type) | ||
self.assertEqual(type(r[['a_1']]._data.blocks[0]), exp_blk_type) | ||
self.assertEqual(type(r[['a_2']]._data.blocks[0]), exp_blk_type) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback How about now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok looks reasonable
ping when green
FYI - don't take lines out of the what's new I put them there on purpose - helps avoid merge conflicts (in fact u have one now because of that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback OK , green now.
Sorry about the whatsnew newlines (though I don't think there's a merge conflict?) So, where is a good place to insert whatsnew notes? Somewhere between two newlines?
merged via d065374 thanks! |
Fixes #10531 .