Skip to content

BUG: process Int64 as ints for preservable ops, not as float64 #32652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Mar 12, 2020

Calling .min() or .max() on pd.Int64 data returns integers, but uses float64 as an intermediate representation, leading to data corruption.

This PR attempts to whitelist them to be processed entirely as Int64s. Also add a test that verifies this behavior, as existing tests failed to detect the bug.

@jorisvandenbossche
Copy link
Member

@qwhelan thanks for looking into this!

I am personally not fully convinced this is the best approach long term (I rather see us looking into #30982, but there is still discussion on the way forward). So short term this can certainly fix the issue at hand.

@qwhelan qwhelan force-pushed the int64_preserved branch 4 times, most recently from 7538352 to 9804f2c Compare March 14, 2020 20:06
@qwhelan
Copy link
Contributor Author

qwhelan commented Mar 14, 2020

@jorisvandenbossche Yeah, I agree this isn't the long-term approach. The issue is that df[df[col] == df[col].max()] is a fairly popular idiom and fails when the left side is an Int64 and the right side is a float64 (if not for type reasons, for the latter losing precision).

@jreback
Copy link
Contributor

jreback commented Mar 17, 2020

so would be ok with this for only min/max right now.

@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Mar 17, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this invades way toomuch on group/frame ops. this should all be pushed back down to the EA ops. So this might require multiple PRs to handle this.

@qwhelan
Copy link
Contributor Author

qwhelan commented Mar 21, 2020

@jreback Updated to the bare minimum to fix IntegerArray.min/max

@jreback
Copy link
Contributor

jreback commented Mar 26, 2020

we are likey to merge: #30982 shortly. Happy to take this as a followon using that approach.

@jbrockmendel
Copy link
Member

@qwhelen does #30982 affect the implementation here?

@qwhelan
Copy link
Contributor Author

qwhelan commented Apr 18, 2020

@jbrockmendel Looks like this is unnecessary now, closing

@qwhelan qwhelan closed this Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame with Int64 columns casts to float64 with .max()/.min()
4 participants