Skip to content

API: common dtype for bool + numeric: upcast to object or coerce to numeric? #39817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Feb 15, 2021 · 8 comments · Fixed by #45101
Closed
Labels
API - Consistency Internal Consistency of API/Behavior Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions

Comments

@jorisvandenbossche
Copy link
Member

We currently have an inconsistency in how we determine the common dtype for bool + numeric.

Numpy coerces booleans to numeric values when combining with numeric dtype:

>> np.concatenate([np.array([True]), np.array([1])])
array([1, 1])

In pandas, Series does the same:

>>> pd.concat([pd.Series([True], dtype=bool), pd.Series([1.0], dtype=float)])
0    1.0
0    1.0
dtype: float64

except if they are empty, then we ensure the result is object dtype:

>>> pd.concat([pd.Series([], dtype=bool), pd.Series([], dtype=float)])
Series([], dtype: object)

And for DataFrame we return object dtype in all cases:

>>> pd.concat([pd.DataFrame({'a': np.array([], dtype=bool)}), pd.DataFrame({'a': np.array([], dtype=float)})]).dtypes
a    object
dtype: object
>>> pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}), pd.DataFrame({'a': np.array([1.0], dtype=float)})]).dtypes
a    object
dtype: object

For the nullable dtypes, we also have implemented this a bit inconsistently up to now:

>>> pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Int64")])
0    1
0    1
dtype: Int64

>>> pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Float64")])
0    True
0     1.0
dtype: object

So here we preserve numeric dtype for Integer, but convert to object for Float. Now, the reason for this is because IntegerDtype._get_common_dtype handles the case of boolean dtype and then uses the numpy rules to determine the result dtype, while the FloatingDtype doesn't yet handle non-float dtypes and thus results in object dtype for anything else (also for float numpy dtype, which is obviously a bug / missing feature)


Basically we need to decide what the desired behaviour is for the bool + numeric dtype combination: coerce to numeric or upcast to object? (and then fix the inconsistencies according to the decided rule)

@jorisvandenbossche jorisvandenbossche added Dtype Conversions Unexpected or buggy dtype conversions API - Consistency Internal Consistency of API/Behavior labels Feb 15, 2021
@jorisvandenbossche
Copy link
Member Author

A non-concat example where this also gives a slightly strange result: when doing a reduction, we consider boolean columns to be numeric (numeric_only=True includes them). But for some aggregations, you can then end up with an object-dtype result, even though you explicitly asked for numeric_only:

In [28]: pd.DataFrame({'a': [True, False], 'b': [1, 2]}).min(numeric_only=True)
Out[28]: 
a    False
b        1
dtype: object

See eg #39607

@rhshadrach
Copy link
Member

+1 on coerce to numeric, upcasting to object seems very odd to me here.

@jbrockmendel
Copy link
Member

+1 on coerce to numeric, upcasting to object seems very odd to me here.

The case in which coercing to numeric is weird is:

ser = pd.Series([True, False, True], dtype=bool)
ser[0] = np.nan

>>> ser
0    NaN
1    0.0
2    1.0
dtype: float64

I'd much rather retain True and False as not-floats, which (until Boolean becomes the default) means casting to object.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2021

i think we should not cast bool to float ; you are losing info here

numpy does this because they don't have a nullable type

we should just cast to object if needed now (and we recently fixed this in master)

@rhshadrach
Copy link
Member

I'd much rather retain True and False as not-floats, which (until Boolean becomes the default) means casting to object.

Agreed with the first clause, but think we should value API consistency (fewer surprises) in the meantime. If you change to

ser = pd.Series([1, 0, 1], dtype=int)
ser[0] = np.nan

should the dtype here object or float? I've always expected float, and if that is correct, is there a reason bool should be different?

@jbrockmendel
Copy link
Member

that would change to float. changing 0 and 1 to 0.0 and 1.0 doesn't semantically change their meanings the way it does for bools

@jorisvandenbossche
Copy link
Member Author

I think the main reason that we are currently casting to object (in some cases) is indeed the "missing values" issue, as brought up. We generally consider object dtype with bools + NaN as bool-like.

And so eg reindexing / alignment will give object dtype (and not float, although we are inserting np.nan):

In [72]: pd.Series([True, False]).reindex([1, 2])
Out[72]: 
1    False
2      NaN
dtype: object

and with the above you can then fill missing values and eg further use for boolean indexing. If we would strictly cast to float because of the NaN, that would no longer be possible (you can't do boolean indexing with bool-like floats).

But so a potential option could also be to special case NaN (until we have proper missing value support with "boolean" dtype), and continue to cast to object when inserting NaN missing values / concat with all NaNs. But we could change to cast to float when you are concatting with actual float values.

That would then eg solve this discrepancy (bool + int gives int, but bool + float gives object):

In [78]: pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}),
    ...:            pd.DataFrame({'a': np.array([1], dtype=int)})])
Out[78]: 
   a
0  1
0  1

In [79]: pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}),
    ...:            pd.DataFrame({'a': np.array([1.0], dtype=float)})])
Out[79]: 
      a
0  True
0   1.0

Of course, special-casing NaN is also certainly not great, but that is something we already do in various places (eg when concatting we already check the all-NaN case in various situations for other dtypes)

@jbrockmendel
Copy link
Member

I'm running into this in #45061.

I'd expect the result concat([obj_with_dtype1, obj_with_dtype2]) to have dtype matching pd.core.dtypes.cast.find_common_type([dtype1, dtype2]), which for bool+int gives object dtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants