BUG: process Int64 as ints for preservable ops, not as float64 #32652

qwhelan · 2020-03-12T08:35:13Z

Calling .min() or .max() on pd.Int64 data returns integers, but uses float64 as an intermediate representation, leading to data corruption.

This PR attempts to whitelist them to be processed entirely as Int64s. Also add a test that verifies this behavior, as existing tests failed to detect the bug.

closes DataFrame with Int64 columns casts to float64 with .max()/.min() #32651
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pandas/core/common.py

jorisvandenbossche · 2020-03-13T14:49:55Z

@qwhelan thanks for looking into this!

I am personally not fully convinced this is the best approach long term (I rather see us looking into #30982, but there is still discussion on the way forward). So short term this can certainly fix the issue at hand.

qwhelan · 2020-03-14T20:12:34Z

@jorisvandenbossche Yeah, I agree this isn't the long-term approach. The issue is that df[df[col] == df[col].max()] is a fairly popular idiom and fails when the left side is an Int64 and the right side is a float64 (if not for type reasons, for the latter losing precision).

pandas/core/groupby/ops.py

jreback · 2020-03-17T02:48:58Z

so would be ok with this for only min/max right now.

jreback

this invades way toomuch on group/frame ops. this should all be pushed back down to the EA ops. So this might require multiple PRs to handle this.

qwhelan · 2020-03-21T06:40:53Z

@jreback Updated to the bare minimum to fix IntegerArray.min/max

jreback · 2020-03-26T01:50:23Z

we are likey to merge: #30982 shortly. Happy to take this as a followon using that approach.

jbrockmendel · 2020-04-16T19:26:04Z

@qwhelen does #30982 affect the implementation here?

qwhelan · 2020-04-18T02:46:09Z

@jbrockmendel Looks like this is unnecessary now, closing

jbrockmendel reviewed Mar 12, 2020

View reviewed changes

pandas/core/common.py Outdated Show resolved Hide resolved

qwhelan force-pushed the int64_preserved branch 4 times, most recently from 7538352 to 9804f2c Compare March 14, 2020 20:06

jbrockmendel reviewed Mar 14, 2020

View reviewed changes

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

qwhelan force-pushed the int64_preserved branch from 9804f2c to a2aaa3d Compare March 14, 2020 20:13

jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Mar 17, 2020

jreback requested changes Mar 17, 2020

View reviewed changes

BUG: process Int64 as ints for preservable ops, not as float64

db1344a

qwhelan force-pushed the int64_preserved branch from a2aaa3d to db1344a Compare March 21, 2020 06:21

qwhelan closed this Apr 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: process Int64 as ints for preservable ops, not as float64 #32652

BUG: process Int64 as ints for preservable ops, not as float64 #32652

qwhelan commented Mar 12, 2020 •

edited

Loading

jorisvandenbossche commented Mar 13, 2020

qwhelan commented Mar 14, 2020

jreback commented Mar 17, 2020

jreback left a comment

qwhelan commented Mar 21, 2020

jreback commented Mar 26, 2020

jbrockmendel commented Apr 16, 2020

qwhelan commented Apr 18, 2020

BUG: process Int64 as ints for preservable ops, not as float64 #32652

BUG: process Int64 as ints for preservable ops, not as float64 #32652

Conversation

qwhelan commented Mar 12, 2020 • edited Loading

jorisvandenbossche commented Mar 13, 2020

qwhelan commented Mar 14, 2020

jreback commented Mar 17, 2020

jreback left a comment

Choose a reason for hiding this comment

qwhelan commented Mar 21, 2020

jreback commented Mar 26, 2020

jbrockmendel commented Apr 16, 2020

qwhelan commented Apr 18, 2020

qwhelan commented Mar 12, 2020 •

edited

Loading