API: common dtype for bool + numeric: upcast to object or coerce to numeric? #39817

jorisvandenbossche · 2021-02-15T10:33:11Z

We currently have an inconsistency in how we determine the common dtype for bool + numeric.

Numpy coerces booleans to numeric values when combining with numeric dtype:

>> np.concatenate([np.array([True]), np.array([1])])
array([1, 1])

In pandas, Series does the same:

>>> pd.concat([pd.Series([True], dtype=bool), pd.Series([1.0], dtype=float)])
0    1.0
0    1.0
dtype: float64

except if they are empty, then we ensure the result is object dtype:

>>> pd.concat([pd.Series([], dtype=bool), pd.Series([], dtype=float)])
Series([], dtype: object)

And for DataFrame we return object dtype in all cases:

>>> pd.concat([pd.DataFrame({'a': np.array([], dtype=bool)}), pd.DataFrame({'a': np.array([], dtype=float)})]).dtypes
a    object
dtype: object
>>> pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}), pd.DataFrame({'a': np.array([1.0], dtype=float)})]).dtypes
a    object
dtype: object

For the nullable dtypes, we also have implemented this a bit inconsistently up to now:

>>> pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Int64")])
0    1
0    1
dtype: Int64

>>> pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Float64")])
0    True
0     1.0
dtype: object

So here we preserve numeric dtype for Integer, but convert to object for Float. Now, the reason for this is because IntegerDtype._get_common_dtype handles the case of boolean dtype and then uses the numpy rules to determine the result dtype, while the FloatingDtype doesn't yet handle non-float dtypes and thus results in object dtype for anything else (also for float numpy dtype, which is obviously a bug / missing feature)

Basically we need to decide what the desired behaviour is for the bool + numeric dtype combination: coerce to numeric or upcast to object? (and then fix the inconsistencies according to the decided rule)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-02-15T10:36:23Z

A non-concat example where this also gives a slightly strange result: when doing a reduction, we consider boolean columns to be numeric (numeric_only=True includes them). But for some aggregations, you can then end up with an object-dtype result, even though you explicitly asked for numeric_only:

In [28]: pd.DataFrame({'a': [True, False], 'b': [1, 2]}).min(numeric_only=True)
Out[28]: 
a    False
b        1
dtype: object

See eg #39607

rhshadrach · 2021-02-15T22:50:07Z

+1 on coerce to numeric, upcasting to object seems very odd to me here.

jbrockmendel · 2021-02-16T02:28:39Z

+1 on coerce to numeric, upcasting to object seems very odd to me here.

The case in which coercing to numeric is weird is:

ser = pd.Series([True, False, True], dtype=bool)
ser[0] = np.nan

>>> ser
0    NaN
1    0.0
2    1.0
dtype: float64

I'd much rather retain True and False as not-floats, which (until Boolean becomes the default) means casting to object.

jreback · 2021-02-16T02:36:04Z

i think we should not cast bool to float ; you are losing info here

numpy does this because they don't have a nullable type

we should just cast to object if needed now (and we recently fixed this in master)

rhshadrach · 2021-02-16T03:41:44Z

I'd much rather retain True and False as not-floats, which (until Boolean becomes the default) means casting to object.

Agreed with the first clause, but think we should value API consistency (fewer surprises) in the meantime. If you change to

ser = pd.Series([1, 0, 1], dtype=int)
ser[0] = np.nan

should the dtype here object or float? I've always expected float, and if that is correct, is there a reason bool should be different?

jbrockmendel · 2021-02-16T16:24:38Z

that would change to float. changing 0 and 1 to 0.0 and 1.0 doesn't semantically change their meanings the way it does for bools

jorisvandenbossche · 2021-02-16T20:18:54Z

I think the main reason that we are currently casting to object (in some cases) is indeed the "missing values" issue, as brought up. We generally consider object dtype with bools + NaN as bool-like.

And so eg reindexing / alignment will give object dtype (and not float, although we are inserting np.nan):

In [72]: pd.Series([True, False]).reindex([1, 2])
Out[72]: 
1    False
2      NaN
dtype: object

and with the above you can then fill missing values and eg further use for boolean indexing. If we would strictly cast to float because of the NaN, that would no longer be possible (you can't do boolean indexing with bool-like floats).

But so a potential option could also be to special case NaN (until we have proper missing value support with "boolean" dtype), and continue to cast to object when inserting NaN missing values / concat with all NaNs. But we could change to cast to float when you are concatting with actual float values.

That would then eg solve this discrepancy (bool + int gives int, but bool + float gives object):

In [78]: pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}),
    ...:            pd.DataFrame({'a': np.array([1], dtype=int)})])
Out[78]: 
   a
0  1
0  1

In [79]: pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}),
    ...:            pd.DataFrame({'a': np.array([1.0], dtype=float)})])
Out[79]: 
      a
0  True
0   1.0

Of course, special-casing NaN is also certainly not great, but that is something we already do in various places (eg when concatting we already check the all-NaN case in various situations for other dtypes)

jbrockmendel · 2021-12-27T23:07:00Z

I'm running into this in #45061.

I'd expect the result concat([obj_with_dtype1, obj_with_dtype2]) to have dtype matching pd.core.dtypes.cast.find_common_type([dtype1, dtype2]), which for bool+int gives object dtype.

jorisvandenbossche added Dtype Conversions Unexpected or buggy dtype conversions API - Consistency Internal Consistency of API/Behavior labels Feb 15, 2021

This was referenced Feb 15, 2021

BUG: max and min returns incorrect result #39607

Closed

[ArrayManager] REF: Implement concat with reindexing #39612

Merged

jbrockmendel mentioned this issue Dec 28, 2021

DEPR: coercing bools to numeric on concat with numeric dtypes #45101

Merged

4 tasks

jreback added the Deprecate Functionality to remove in pandas label Dec 29, 2021

jreback closed this as completed in #45101 Dec 29, 2021

vyasr mentioned this issue Dec 10, 2022

BUG: Deprecation warning for bool/numeric concatenation seems too broadly scoped #50163

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: common dtype for bool + numeric: upcast to object or coerce to numeric? #39817

API: common dtype for bool + numeric: upcast to object or coerce to numeric? #39817

jorisvandenbossche commented Feb 15, 2021

jorisvandenbossche commented Feb 15, 2021

rhshadrach commented Feb 15, 2021

jbrockmendel commented Feb 16, 2021

jreback commented Feb 16, 2021

rhshadrach commented Feb 16, 2021

jbrockmendel commented Feb 16, 2021

jorisvandenbossche commented Feb 16, 2021

jbrockmendel commented Dec 27, 2021

API: common dtype for bool + numeric: upcast to object or coerce to numeric? #39817

API: common dtype for bool + numeric: upcast to object or coerce to numeric? #39817

Comments

jorisvandenbossche commented Feb 15, 2021

jorisvandenbossche commented Feb 15, 2021

rhshadrach commented Feb 15, 2021

jbrockmendel commented Feb 16, 2021

jreback commented Feb 16, 2021

rhshadrach commented Feb 16, 2021

jbrockmendel commented Feb 16, 2021

jorisvandenbossche commented Feb 16, 2021

jbrockmendel commented Dec 27, 2021