BUG: int dtype for get_dummies #13796

TomAugspurger · 2016-07-26T01:39:52Z

Changes get_dummies return columns with uint8 dtypes instead of coercing to floats if they were alongside other float columns.

jreback · 2016-07-26T01:57:56Z

pandas/core/reshape.py

+            # import pdb; pdb.set_trace()
+            sarr = SparseArray(np.ones(len(ixs), dtype=np.int),
+                               sparse_index=IntIndex(N, ixs), fill_value=0,
+                               dtype=np.int)


these should be the smallest ints that will fit; something like what to_numeric does

By definition the dummy-encoded columns will always be 0/1, so np.uint8?

The other option is bool, but all the get_dummies I've seen in other languages are 0/1, not False/True.

jreback · 2016-07-27T10:41:47Z

pls give a test on windows, as int dtype comparisons are wonky.

jreback · 2016-07-27T10:42:32Z

actually this may not work w/o #667

@sinhrks would know more

TomAugspurger · 2016-07-27T11:19:02Z

Right I should have put a WIP in the title, sorry. That seems like a bit of a rabbit hole, but I'll give it a bit more time this week if I can.

On Jul 27, 2016, at 05:42, Jeff Reback [email protected] wrote:

actually this may not work w/o #667

@sinhrks would know more

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

sinhrks · 2016-07-27T12:07:51Z

I'm just working on #667, hopefully can send something this week... :)

TomAugspurger · 2016-07-27T12:28:47Z

Awesome, thanks :)

jreback · 2016-08-18T22:57:16Z

this has to be after #13849

jorisvandenbossche · 2016-08-31T08:01:22Z

@TomAugspurger I merged the sparse PR of @sinhrks, so if you have the time, you can update this now

codecov-io · 2016-08-31T21:17:03Z

Current coverage is 85.27% (diff: 100%)

Merging #13796 into master will increase coverage by <.01%

@@             master     #13796   diff @@
==========================================
  Files           139        139          
  Lines         50553      50554     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43110      43111     +1   
  Misses         7443       7443          
  Partials          0          0

Powered by Codecov. Last update 58199c5...cace0f7

TomAugspurger · 2016-08-31T21:21:28Z

pandas/core/reshape.py

@@ -1213,7 +1216,7 @@ def make_axis_dummies(frame, axis='minor', transform=None):
        labels = cat.codes
        items = cat.categories

-    values = np.eye(len(items), dtype=float)
+    values = np.eye(len(items), dtype=np.uint8)


@jreback this is a new change from what you saw earlier ( this is ``make_axis_dummies` in core/reshape.py.

One of the panel tests made use of it. I changed it for consistency with the new get_dummies, but happy to change it back to floats, and adjust the test.

technically you need something like _coerce_indexer_dtype from https://github.com/pydata/pandas/blob/master/pandas/types/cast.py

as u don't know if it will fit in a uint8 (u might need. 16 or 32)

u might be able to use just uint and numpy will then tell u what dtype u need

Just doing uint went to a uint64 on my machine. I'm just going to revert this change. As far as I can tell this only affects the user publicly via pd.ols. Given that we're deprecating that, it doesn't make sense to change it now.

TomAugspurger · 2016-08-31T21:22:00Z

Green now. Lots of (small) changes to the tests, so a second pair of eyes would be grateful.

jreback · 2016-08-31T21:23:42Z

doc/source/whatsnew/v0.19.0.txt

@@ -1333,6 +1333,7 @@ Bug Fixes
 - Bug in ``pd.merge()`` may raise ``TypeError`` if input datetime-like has other unit than ``ns`` (:issue:`13389`)

 - Bug in ``HDFStore``/``read_hdf()`` discarded ``DatetimeIndex.name`` if ``tz`` was set (:issue:`13884`)
+- Bug in ``pd.get_dummies`` returning dummy-encoded columns as floats, rather than integers (:issue:`8725`)


not a bug fix more of an API change (not implemented before) ; I would make a sub section to highlite

jorisvandenbossche · 2016-09-01T08:56:35Z

Looks good to me

Closes pandas-dev#8725 Ensures that get_dummies on a DataFrame whose output is a mix of floats / ints & dummy-encoded columns doesn't coerce the dummy-encoded cols from uint8 to ints / floats.

jreback · 2016-09-02T11:31:00Z

thanks!

jreback · 2016-09-02T11:32:23Z

this makes this very memory efficient now! (and of course sparse is even better)

TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Jul 26, 2016

TomAugspurger added this to the 0.19.0 milestone Jul 26, 2016

jreback reviewed Jul 26, 2016
View reviewed changes

jreback modified the milestones: 0.20.0, 0.19.0 Aug 18, 2016

TomAugspurger force-pushed the get_dummies_dtype branch 3 times, most recently from bdfd72e to b48935c Compare August 31, 2016 17:46

TomAugspurger reviewed Aug 31, 2016
View reviewed changes

jreback reviewed Aug 31, 2016
View reviewed changes

TomAugspurger force-pushed the get_dummies_dtype branch from b48935c to 7a9171a Compare August 31, 2016 21:35

jorisvandenbossche modified the milestones: 0.19.0, 0.20.0 Sep 1, 2016

BUG: int dtype for get_dummies

cace0f7

Closes pandas-dev#8725 Ensures that get_dummies on a DataFrame whose output is a mix of floats / ints & dummy-encoded columns doesn't coerce the dummy-encoded cols from uint8 to ints / floats.

TomAugspurger force-pushed the get_dummies_dtype branch from 7a9171a to cace0f7 Compare September 1, 2016 12:36

jreback closed this in ccec504 Sep 2, 2016

TomAugspurger deleted the get_dummies_dtype branch April 5, 2017 02:08

Lunran mentioned this pull request Sep 10, 2017

Output from get_dummies() should default to np.int8 #10708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: int dtype for get_dummies #13796

BUG: int dtype for get_dummies #13796

TomAugspurger commented Jul 26, 2016 •

edited

Loading

jreback Jul 26, 2016

TomAugspurger Jul 27, 2016

jreback commented Jul 27, 2016

jreback commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016

sinhrks commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016

jreback commented Aug 18, 2016

jorisvandenbossche commented Aug 31, 2016

codecov-io commented Aug 31, 2016 •

edited

Loading

TomAugspurger Aug 31, 2016

jreback Aug 31, 2016

jreback Aug 31, 2016

TomAugspurger Sep 1, 2016

TomAugspurger commented Aug 31, 2016

jreback Aug 31, 2016

jorisvandenbossche commented Sep 1, 2016

jreback commented Sep 2, 2016

jreback commented Sep 2, 2016

BUG: int dtype for get_dummies #13796

BUG: int dtype for get_dummies #13796

Conversation

TomAugspurger commented Jul 26, 2016 • edited Loading

jreback Jul 26, 2016

Choose a reason for hiding this comment

TomAugspurger Jul 27, 2016

Choose a reason for hiding this comment

jreback commented Jul 27, 2016

jreback commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016

sinhrks commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016

jreback commented Aug 18, 2016

jorisvandenbossche commented Aug 31, 2016

codecov-io commented Aug 31, 2016 • edited Loading

Current coverage is 85.27% (diff: 100%)

TomAugspurger Aug 31, 2016

Choose a reason for hiding this comment

jreback Aug 31, 2016

Choose a reason for hiding this comment

jreback Aug 31, 2016

Choose a reason for hiding this comment

TomAugspurger Sep 1, 2016

Choose a reason for hiding this comment

TomAugspurger commented Aug 31, 2016

jreback Aug 31, 2016

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 1, 2016

jreback commented Sep 2, 2016

jreback commented Sep 2, 2016

TomAugspurger commented Jul 26, 2016 •

edited

Loading

codecov-io commented Aug 31, 2016 •

edited

Loading