-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Make ExtensionArray:_can_hold_na a class attribute #20815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
na_cmp(result[0], na_value) | ||
if data._can_hold_na: | ||
result = empty.take([-1]) | ||
na_cmp(result[0], na_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is what this should do if _can_hold_na=False
? And in general with -1 indexers (eg reindex
, concat
with realignment, ...)
I suppose to raise an error? Or coerce to object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that a design decision that sits above this particular test code? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe, but then I am just wondering what this design decision is (and we should also test that here).
As far as I know, this is rather new terrain (in pandas we only have ints, and they are upcasted to float), so it might be we still need to discuss/decide on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have a BoolBlock that cannot hold NaNs, so a boolean Series gets converted to object Series when NaNs are introduced:
In [49]: pd.Series([True, False])
Out[49]:
0 True
1 False
dtype: bool
In [50]: pd.Series([True, False]).reindex([1, 2])
Out[50]:
1 False
2 NaN
dtype: object
So this is some prior art
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So by the statement "we still need to discuss/decide on this", I presume that the core dev team will make that call. Right? (FWIW, I think I may use _can_hold_na=True
, and since I'm storing objects, I'll just use NaN to be the na_value that is then used in reindex
, shift
, etc.), and I'd want the type of the resulting array to remain in my subclass of ExtensionArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So by the statement "we still need to discuss/decide on this", I presume that the core dev team will make that call. Right?
No, this is an open discussion, and your voice in this discussion is important since you are actually one of the first people trying to use this (both Tom and I do not use _can_hold_na=False
in our applications).
But if you would also not use this, we should maybe consider removing the feature for now (meaning this will always be True) until somebody comes up with a good use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I've come full circle on this. In my case, I thought that I would not want NA in my arrays. My objects stored in the arrays have no NA value that makes sense. And then when you brought up reindex
as an example, I realized that I would use things like shift
and would want the NA to be filled in, but keeping the ExtensionArray
class as the holder, meaning that I need _can_hold_na
as True. I'd want ExtensionArray
as holding it so that arithmetic would work, as coercing to an object
array would mess up the arithmetic. So I'd have no problem if we dropped _can_hold_na
from the ExtensionArray
design, and tell people that they must define a _na_value(), which is then used as the "fill in" when we do things like reindex
and shift
. And because there are pandas operations that need some NA value for fill in purposes, requiring it on ExtensionArray
shouldn't be an issue (IMHO).
If we agree that getting rid of _can_hold_na
is the thing to do, I'll modify this PR to totally get rid of it (and the associated tests).
|
||
xpr = "Cannot coerce extension array to dtype 'int64'. " | ||
with tm.assert_raises_regex(ValueError, xpr): | ||
pd.DataFrame({"A": arr}, dtype='int64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those constructor tests are not specific to the "nona", so I would move them to a separate file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied them from test_decimal.py and changed the array type. I guess we need something in the base tests. I'll look at that.
|
||
def test_series_constructor_with_same_dtype_ok(): | ||
arr = DecimalNoNaArray([decimal.Decimal('10.0')]) | ||
result = pd.Series(arr, dtype=DecimalDtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe something we need to discuss, as in the take
PR Tom was adding an _na_value
to the dtype itself. We should see how this interacts
Let's put the discussion here to not hide it in a review comment: So I questioned what should happen when NAs are introduced (eg reindex) in ExtensionArray that has Currently, in pandas we have ints, and they are upcasted to float. We also have a BoolBlock that cannot hold NaNs, so a boolean Series gets converted to object Series when NaNs are introduced:
So this is some prior art. But it would also mean you loose your ExtensionArray/ExtensionDtype information. So remark: if we currenlty don't have users of @Dr-Irv :
|
i would not add API unless we have a need for it |
Closing due to new PR #20819 |
_can_hold_na==False
git diff upstream/master -u -- "*.py" | flake8 --diff
ExtensionArray
interfaceSupersedes #20801 by removing ops functionality from there.