Skip to content

BUG: int dtype for get_dummies #13796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jul 26, 2016

closes #8725

Changes get_dummies return columns with uint8 dtypes instead of coercing to floats if they were alongside other float columns.

@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Jul 26, 2016
@TomAugspurger TomAugspurger added this to the 0.19.0 milestone Jul 26, 2016
# import pdb; pdb.set_trace()
sarr = SparseArray(np.ones(len(ixs), dtype=np.int),
sparse_index=IntIndex(N, ixs), fill_value=0,
dtype=np.int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be the smallest ints that will fit; something like what to_numeric does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By definition the dummy-encoded columns will always be 0/1, so np.uint8?

The other option is bool, but all the get_dummies I've seen in other languages are 0/1, not False/True.

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

pls give a test on windows, as int dtype comparisons are wonky.

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

actually this may not work w/o #667

@sinhrks would know more

@TomAugspurger
Copy link
Contributor Author

Right I should have put a WIP in the title, sorry. That seems like a bit of a rabbit hole, but I'll give it a bit more time this week if I can.

On Jul 27, 2016, at 05:42, Jeff Reback [email protected] wrote:

actually this may not work w/o #667

@sinhrks would know more


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@sinhrks
Copy link
Member

sinhrks commented Jul 27, 2016

I'm just working on #667, hopefully can send something this week... :)

@TomAugspurger
Copy link
Contributor Author

Awesome, thanks :)

@jreback
Copy link
Contributor

jreback commented Aug 18, 2016

this has to be after #13849

@jreback jreback modified the milestones: 0.20.0, 0.19.0 Aug 18, 2016
@jorisvandenbossche
Copy link
Member

@TomAugspurger I merged the sparse PR of @sinhrks, so if you have the time, you can update this now

@TomAugspurger TomAugspurger force-pushed the get_dummies_dtype branch 3 times, most recently from bdfd72e to b48935c Compare August 31, 2016 17:46
@codecov-io
Copy link

codecov-io commented Aug 31, 2016

Current coverage is 85.27% (diff: 100%)

Merging #13796 into master will increase coverage by <.01%

@@             master     #13796   diff @@
==========================================
  Files           139        139          
  Lines         50553      50554     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43110      43111     +1   
  Misses         7443       7443          
  Partials          0          0          

Powered by Codecov. Last update 58199c5...cace0f7

@@ -1213,7 +1216,7 @@ def make_axis_dummies(frame, axis='minor', transform=None):
labels = cat.codes
items = cat.categories

values = np.eye(len(items), dtype=float)
values = np.eye(len(items), dtype=np.uint8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback this is a new change from what you saw earlier ( this is ``make_axis_dummies` in core/reshape.py.

One of the panel tests made use of it. I changed it for consistency with the new get_dummies, but happy to change it back to floats, and adjust the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically you need something like _coerce_indexer_dtype from https://github.com/pydata/pandas/blob/master/pandas/types/cast.py

as u don't know if it will fit in a uint8 (u might need. 16 or 32)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u might be able to use just uint and numpy will then tell u what dtype u need

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just doing uint went to a uint64 on my machine. I'm just going to revert this change. As far as I can tell this only affects the user publicly via pd.ols. Given that we're deprecating that, it doesn't make sense to change it now.

@TomAugspurger
Copy link
Contributor Author

Green now. Lots of (small) changes to the tests, so a second pair of eyes would be grateful.

@@ -1333,6 +1333,7 @@ Bug Fixes
- Bug in ``pd.merge()`` may raise ``TypeError`` if input datetime-like has other unit than ``ns`` (:issue:`13389`)

- Bug in ``HDFStore``/``read_hdf()`` discarded ``DatetimeIndex.name`` if ``tz`` was set (:issue:`13884`)
- Bug in ``pd.get_dummies`` returning dummy-encoded columns as floats, rather than integers (:issue:`8725`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a bug fix more of an API change (not implemented before) ; I would make a sub section to highlite

@jorisvandenbossche
Copy link
Member

Looks good to me

Closes pandas-dev#8725

Ensures that get_dummies on a DataFrame whose output is a mix of
floats / ints & dummy-encoded columns doesn't coerce the dummy-encoded
cols from uint8 to ints / floats.
@jreback
Copy link
Contributor

jreback commented Sep 2, 2016

thanks!

@jreback
Copy link
Contributor

jreback commented Sep 2, 2016

this makes this very memory efficient now! (and of course sparse is even better)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: dtype compat with get_dummies
5 participants