-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: common dtype for bool + numeric: upcast to object or coerce to numeric? #39817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A non-concat example where this also gives a slightly strange result: when doing a reduction, we consider boolean columns to be numeric (
See eg #39607 |
+1 on coerce to numeric, upcasting to object seems very odd to me here. |
The case in which coercing to numeric is weird is:
I'd much rather retain True and False as not-floats, which (until Boolean becomes the default) means casting to object. |
i think we should not cast bool to float ; you are losing info here numpy does this because they don't have a nullable type we should just cast to object if needed now (and we recently fixed this in master) |
Agreed with the first clause, but think we should value API consistency (fewer surprises) in the meantime. If you change to
should the dtype here object or float? I've always expected float, and if that is correct, is there a reason bool should be different? |
that would change to float. changing 0 and 1 to 0.0 and 1.0 doesn't semantically change their meanings the way it does for bools |
I think the main reason that we are currently casting to object (in some cases) is indeed the "missing values" issue, as brought up. We generally consider object dtype with bools + NaN as bool-like. And so eg reindexing / alignment will give object dtype (and not float, although we are inserting np.nan):
and with the above you can then fill missing values and eg further use for boolean indexing. If we would strictly cast to float because of the NaN, that would no longer be possible (you can't do boolean indexing with bool-like floats). But so a potential option could also be to special case NaN (until we have proper missing value support with "boolean" dtype), and continue to cast to object when inserting NaN missing values / concat with all NaNs. But we could change to cast to float when you are concatting with actual float values. That would then eg solve this discrepancy (bool + int gives int, but bool + float gives object):
Of course, special-casing NaN is also certainly not great, but that is something we already do in various places (eg when concatting we already check the all-NaN case in various situations for other dtypes) |
I'm running into this in #45061. I'd expect the result |
We currently have an inconsistency in how we determine the common dtype for bool + numeric.
Numpy coerces booleans to numeric values when combining with numeric dtype:
In pandas, Series does the same:
except if they are empty, then we ensure the result is object dtype:
And for DataFrame we return object dtype in all cases:
For the nullable dtypes, we also have implemented this a bit inconsistently up to now:
So here we preserve numeric dtype for Integer, but convert to object for Float. Now, the reason for this is because
IntegerDtype._get_common_dtype
handles the case of boolean dtype and then uses the numpy rules to determine the result dtype, while the FloatingDtype doesn't yet handle non-float dtypes and thus results in object dtype for anything else (also for float numpy dtype, which is obviously a bug / missing feature)Basically we need to decide what the desired behaviour is for the bool + numeric dtype combination: coerce to numeric or upcast to object? (and then fix the inconsistencies according to the decided rule)
The text was updated successfully, but these errors were encountered: