Diff low precision ints #45609

lukemanley · 2022-01-25T02:29:30Z

closes BUG: inconsistent upcasting dtypes between diff and shift for int8/int16 #45562
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Submitting this PR as a potential option. This changes the diff method to upcast int8/int16 to float64 similar to shift and other methods. It avoids the int8/int16 > float64 cython issue referenced in the code by doing an initial cast from int8/int16 > int32. Note the tests were updated to now expect float64 in a few cases.

If the existing behavior is preferred, I'm happy to close this. I thought the consistency and fewer special cases elsewhere might be worth the change here.

jreback · 2022-01-28T00:31:25Z

@lukemanley can you add a whatsnew note in 1.5

cc @jbrockmendel @rhshadrach any objections?

jbrockmendel · 2022-01-28T16:35:57Z

im ambivalent. being consistent with shift would be nice, but a) i dont like the extra copy this makes and b) users using small-itemsize dtypes probably want small-itemsize results

rhshadrach · 2022-01-28T20:23:59Z

Everything being equal, I'd rather see shift support lower precision dtypes, but perhaps that isn't currently feasible? I haven't looked.

lukemanley · 2022-01-28T22:02:11Z

im ambivalent. being consistent with shift would be nice, but a) i dont like the extra copy this makes and b) users using small-itemsize dtypes probably want small-itemsize results

Agreed, the extra copy is unfortunate.

Everything being equal, I'd rather see shift support lower precision dtypes, but perhaps that isn't currently feasible? I haven't looked.

I think there are a number of examples beyond shift. When upcasting, most methods seem to upcast low precision ints to float64 even if a lower precision like float32 could suffice. It seems diff stands alone in this aspect. Here are two other examples where low precision ints are upcast to float64:

>>> pd.Series([1, 2, 3], dtype='int8').reindex(range(4))
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

>>> pd.Series([1, 2, 3], dtype='int8').rolling(1).max()
0    1.0
1    2.0
2    3.0
dtype: float64

jreback · 2022-01-31T00:37:04Z

can you merge master here

lukemanley · 2022-01-31T13:00:57Z

Merged main, errors look unrelated

rhshadrach · 2022-01-31T13:47:49Z

I think there are a number of examples beyond shift. When upcasting, most methods seem to upcast low precision ints to float64 even if a lower precision like float32 could suffice. It seems diff stands alone in this aspect.

I think the consensus is that we'd like to see better support for lower precision ops. We aren't going to get there all at once, so there will be inconsistencies in support for quite some time. And we certainly aren't going to get there by taking ops that do support lower precision and removing that support for the purpose of consistency. For that reason, I'm -1 here.

jreback · 2022-01-31T13:58:44Z

I think there are a number of examples beyond shift. When upcasting, most methods seem to upcast low precision ints to float64 even if a lower precision like float32 could suffice. It seems diff stands alone in this aspect.

I think the consensus is that we'd like to see better support for lower precision ops. We aren't going to get there all at once, so there will be inconsistencies in support for quite some time. And we certainly aren't going to get there by taking ops that do support lower precision and removing that support for the purpose of consistency. For that reason, I'm -1 here.

so generally I agree with you. but this little inconsistency is troubling. wouldn't it be better to make a systematic change to support this rather than doing it piecemeal?

rhshadrach · 2022-01-31T21:07:05Z

wouldn't it be better to make a systematic change to support this rather than doing it piecemeal?

If this were possible to do all in one go, I'm all for it. But I don't think that is likely - there are a ton of ops, including indexing, that need changed.

lukemanley · 2022-02-03T00:42:29Z

I'm happy to close this one if the preference is to keep the current behavior.

If the current behavior is preferred, diff could treat uint8 and uint16 similarly and upcast to float32 rather than float64:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    t: np.arange(3, dtype=t) 
    for t in ['int8', 'int16', 'uint8', 'uint16']
})

df.diff(1).dtypes

int8      float32
int16     float32
uint8     float64
uint16    float64
dtype: object

jreback · 2022-02-26T19:50:24Z

ok i guess let's close this and work on making more functions preserve dtypes.

lukemanley added 2 commits January 24, 2022 20:58

diff to upcast int8/int16 to float64 to be consistent with other metods

0311f19

Merge branch 'main' into diff-low-precision-ints

3848bdf

lukemanley mentioned this pull request Jan 25, 2022

PERF: faster groupby diff #45575

Merged

4 tasks

jreback added Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Jan 28, 2022

lukemanley added 2 commits January 27, 2022 21:17

Merge branch 'main' into diff-low-precision-ints

39771df

whatsnew

6e0ccce

jreback added this to the 1.5 milestone Jan 31, 2022

Merge branch 'main' into diff-low-precision-ints

8885700

Merge branch 'main' into diff-low-precision-ints

90a435a

jreback closed this Feb 26, 2022

lukemanley deleted the diff-low-precision-ints branch March 20, 2022 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diff low precision ints #45609

Diff low precision ints #45609

lukemanley commented Jan 25, 2022 •

edited

Loading

jreback commented Jan 28, 2022

jbrockmendel commented Jan 28, 2022

rhshadrach commented Jan 28, 2022

lukemanley commented Jan 28, 2022

jreback commented Jan 31, 2022

lukemanley commented Jan 31, 2022

rhshadrach commented Jan 31, 2022

jreback commented Jan 31, 2022

rhshadrach commented Jan 31, 2022

lukemanley commented Feb 3, 2022

jreback commented Feb 26, 2022

Diff low precision ints #45609

Diff low precision ints #45609

Conversation

lukemanley commented Jan 25, 2022 • edited Loading

jreback commented Jan 28, 2022

jbrockmendel commented Jan 28, 2022

rhshadrach commented Jan 28, 2022

lukemanley commented Jan 28, 2022

jreback commented Jan 31, 2022

lukemanley commented Jan 31, 2022

rhshadrach commented Jan 31, 2022

jreback commented Jan 31, 2022

rhshadrach commented Jan 31, 2022

lukemanley commented Feb 3, 2022

jreback commented Feb 26, 2022

lukemanley commented Jan 25, 2022 •

edited

Loading