-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: int dtype for get_dummies #13796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# import pdb; pdb.set_trace() | ||
sarr = SparseArray(np.ones(len(ixs), dtype=np.int), | ||
sparse_index=IntIndex(N, ixs), fill_value=0, | ||
dtype=np.int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be the smallest ints that will fit; something like what to_numeric does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By definition the dummy-encoded columns will always be 0/1, so np.uint8
?
The other option is bool
, but all the get_dummies
I've seen in other languages are 0/1, not False/True.
pls give a test on windows, as int dtype comparisons are wonky. |
Right I should have put a WIP in the title, sorry. That seems like a bit of a rabbit hole, but I'll give it a bit more time this week if I can.
|
I'm just working on #667, hopefully can send something this week... :) |
Awesome, thanks :) |
this has to be after #13849 |
@TomAugspurger I merged the sparse PR of @sinhrks, so if you have the time, you can update this now |
bdfd72e
to
b48935c
Compare
Current coverage is 85.27% (diff: 100%)@@ master #13796 diff @@
==========================================
Files 139 139
Lines 50553 50554 +1
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43110 43111 +1
Misses 7443 7443
Partials 0 0
|
@@ -1213,7 +1216,7 @@ def make_axis_dummies(frame, axis='minor', transform=None): | |||
labels = cat.codes | |||
items = cat.categories | |||
|
|||
values = np.eye(len(items), dtype=float) | |||
values = np.eye(len(items), dtype=np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback this is a new change from what you saw earlier ( this is ``make_axis_dummies` in core/reshape.py.
One of the panel tests made use of it. I changed it for consistency with the new get_dummies
, but happy to change it back to floats, and adjust the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically you need something like _coerce_indexer_dtype from https://github.com/pydata/pandas/blob/master/pandas/types/cast.py
as u don't know if it will fit in a uint8 (u might need. 16 or 32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u might be able to use just uint and numpy will then tell u what dtype u need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just doing uint
went to a uint64 on my machine. I'm just going to revert this change. As far as I can tell this only affects the user publicly via pd.ols
. Given that we're deprecating that, it doesn't make sense to change it now.
Green now. Lots of (small) changes to the tests, so a second pair of eyes would be grateful. |
@@ -1333,6 +1333,7 @@ Bug Fixes | |||
- Bug in ``pd.merge()`` may raise ``TypeError`` if input datetime-like has other unit than ``ns`` (:issue:`13389`) | |||
|
|||
- Bug in ``HDFStore``/``read_hdf()`` discarded ``DatetimeIndex.name`` if ``tz`` was set (:issue:`13884`) | |||
- Bug in ``pd.get_dummies`` returning dummy-encoded columns as floats, rather than integers (:issue:`8725`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a bug fix more of an API change (not implemented before) ; I would make a sub section to highlite
b48935c
to
7a9171a
Compare
Looks good to me |
Closes pandas-dev#8725 Ensures that get_dummies on a DataFrame whose output is a mix of floats / ints & dummy-encoded columns doesn't coerce the dummy-encoded cols from uint8 to ints / floats.
7a9171a
to
cace0f7
Compare
thanks! |
this makes this very memory efficient now! (and of course sparse is even better) |
closes #8725
Changes
get_dummies
return columns withuint8
dtypes instead of coercing to floats if they were alongside other float columns.