Skip to content

REF: NumericIndex.union determine casting dtype using concat logic #37258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbrockmendel opened this issue Oct 19, 2020 · 8 comments
Open
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Index Related to the Index class or subclasses Refactor Internal refactoring of code setops union, intersection, difference, symmetric_difference

Comments

@jbrockmendel
Copy link
Member

ATM NumericIndex._union has special casting rules for what dtype it casts to when unioning with another NumericIndex of a different dtype.

        # Right now, we treat union(int, float) a bit special.
        # See https://github.com/pandas-dev/pandas/issues/26778 for discussion
        # We may change union(int, float) to go to object.
        # float | [u]int -> float  (the special case)
        # <T>   | <T>    -> T
        # <T>   | <U>    -> object

Instead, we should re-use code (not sure where this lives, cc @jorisvandenbossche ?) from concat (maybe even Block.putmask?) to determing the appropriate upcast dtype.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member API - Consistency Internal Consistency of API/Behavior Refactor Internal refactoring of code Index Related to the Index class or subclasses and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 19, 2020
@jorisvandenbossche
Copy link
Member

This logic lives in pandas.core.dtypes.cast.find_common_type and pandas.core.dtypes.concat._cast_to_common_type. It would indeed be good to re-use this more generally.

For the float/int combo, that already does the same as the "special case" mentioned above. Not sure if there would actually be differences in behaviour right now when switching to use this logic.

@jbrockmendel
Copy link
Member Author

Not sure if there would actually be differences in behaviour right now when switching to use this logic.

Looks like we have 7 tests in tests.indexes.test_setops that fail after this change. The tests expect object dtype but instead get float64. I'll need to look more carefully to check that these make sense.

If perf impact weren't an issue, the id want to do min/max checks to ensure that the casting being done is non-lossy. i guess the advantage of the current logic is that it is always non-lossy.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 28, 2020

Can you give an example of such a failing test where it was before object dtype and now float? For which dtypes does this happen? (Doing a union of int index with float index already gives float index on master)

@jbrockmendel
Copy link
Member Author

The affected tests are almost all in tests.indexes.test_setops. A simple example:

a = Int64Index([])
b = UInt64Index([])

res = a.union(b)  # <-- float64, test as written expects object

@jorisvandenbossche
Copy link
Member

Ah, so the int/uint combo ((u)int/float combo already gives float).

We discussed this before, specifically for indices, see eg #26778 (comment) and comments below. So the current behaviour of int + uint -> object for index is intentionally different from the behaviour for similar dtypes with Series.

So if we decide to keep this difference, then we will need to special case this in the Index set ops code, and it won't be able to fully use the find_common_type logic.

cc @TomAugspurger

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 3, 2020 via email

@jbrockmendel
Copy link
Member Author

my preference is for setops between our label containers to not lose precision

Thats reasonable. Any reason why this this preference would only apply to setops or only for Index?

@jbrockmendel jbrockmendel added the setops union, intersection, difference, symmetric_difference label Jun 17, 2021
@mroeschke mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Aug 13, 2021
@bashtage
Copy link
Contributor

bashtage commented Sep 3, 2021

These casting values do not seem correct. i4 + i4 = i8?

import pandas as pd
import numpy as np

a = pd.DataFrame([[1],[2]],index=pd.NumericIndex(np.arange(2),dtype="i4"))
b = pd.DataFrame([[3],[4]],index=pd.NumericIndex(np.arange(2,4),dtype="i4"))

pd.concat([a,b],axis=0).index
Out[10]: Int64Index([0, 1, 2, 3], dtype='int64')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Index Related to the Index class or subclasses Refactor Internal refactoring of code setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

5 participants