Skip to content

BUG: Make ExtensionArray:_can_hold_na a class attribute #20815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Apr 24, 2018

  • closes ExtensionArray._can_hold_na needs to be a class attribute #20761
  • tests added / passed
    • pandas/tests/extension/decimal/test_nona.py
      • This tests an example where _can_hold_na==False
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
    • Not included as it is a fix to the new ExtensionArray interface

Supersedes #20801 by removing ops functionality from there.

na_cmp(result[0], na_value)
if data._can_hold_na:
result = empty.take([-1])
na_cmp(result[0], na_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is what this should do if _can_hold_na=False ? And in general with -1 indexers (eg reindex, concat with realignment, ...)
I suppose to raise an error? Or coerce to object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that a design decision that sits above this particular test code? Or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe, but then I am just wondering what this design decision is (and we should also test that here).

As far as I know, this is rather new terrain (in pandas we only have ints, and they are upcasted to float), so it might be we still need to discuss/decide on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a BoolBlock that cannot hold NaNs, so a boolean Series gets converted to object Series when NaNs are introduced:

In [49]: pd.Series([True, False])
Out[49]: 
0     True
1    False
dtype: bool

In [50]: pd.Series([True, False]).reindex([1, 2])
Out[50]: 
1    False
2      NaN
dtype: object

So this is some prior art

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by the statement "we still need to discuss/decide on this", I presume that the core dev team will make that call. Right? (FWIW, I think I may use _can_hold_na=True, and since I'm storing objects, I'll just use NaN to be the na_value that is then used in reindex, shift, etc.), and I'd want the type of the resulting array to remain in my subclass of ExtensionArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by the statement "we still need to discuss/decide on this", I presume that the core dev team will make that call. Right?

No, this is an open discussion, and your voice in this discussion is important since you are actually one of the first people trying to use this (both Tom and I do not use _can_hold_na=False in our applications).

But if you would also not use this, we should maybe consider removing the feature for now (meaning this will always be True) until somebody comes up with a good use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I've come full circle on this. In my case, I thought that I would not want NA in my arrays. My objects stored in the arrays have no NA value that makes sense. And then when you brought up reindex as an example, I realized that I would use things like shift and would want the NA to be filled in, but keeping the ExtensionArray class as the holder, meaning that I need _can_hold_na as True. I'd want ExtensionArray as holding it so that arithmetic would work, as coercing to an object array would mess up the arithmetic. So I'd have no problem if we dropped _can_hold_na from the ExtensionArray design, and tell people that they must define a _na_value(), which is then used as the "fill in" when we do things like reindex and shift. And because there are pandas operations that need some NA value for fill in purposes, requiring it on ExtensionArray shouldn't be an issue (IMHO).

If we agree that getting rid of _can_hold_na is the thing to do, I'll modify this PR to totally get rid of it (and the associated tests).


xpr = "Cannot coerce extension array to dtype 'int64'. "
with tm.assert_raises_regex(ValueError, xpr):
pd.DataFrame({"A": arr}, dtype='int64')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those constructor tests are not specific to the "nona", so I would move them to a separate file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied them from test_decimal.py and changed the array type. I guess we need something in the base tests. I'll look at that.


def test_series_constructor_with_same_dtype_ok():
arr = DecimalNoNaArray([decimal.Decimal('10.0')])
result = pd.Series(arr, dtype=DecimalDtype())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe something we need to discuss, as in the take PR Tom was adding an _na_value to the dtype itself. We should see how this interacts

@jorisvandenbossche
Copy link
Member

Let's put the discussion here to not hide it in a review comment:

So I questioned what should happen when NAs are introduced (eg reindex) in ExtensionArray that has _can_hold_na=False.
I suppose to raise an error? Or coerce to object?

Currently, in pandas we have ints, and they are upcasted to float. We also have a BoolBlock that cannot hold NaNs, so a boolean Series gets converted to object Series when NaNs are introduced:

In [49]: pd.Series([True, False])
Out[49]: 
0     True
1    False
dtype: bool

In [50]: pd.Series([True, False]).reindex([1, 2])
Out[50]: 
1    False
2      NaN
dtype: object

So this is some prior art. But it would also mean you loose your ExtensionArray/ExtensionDtype information.

So remark: if we currenlty don't have users of _can_hold_na=False, we should maybe consider removing the feature for now (meaning this will always be True) until somebody comes up with a good use case.

@Dr-Irv :

Well, I've come full circle on this. In my case, I thought that I would not want NA in my arrays. My objects stored in the arrays have no NA value that makes sense. And then when you brought up reindex as an example, I realized that I would use things like shift and would want the NA to be filled in, but keeping the ExtensionArray class as the holder, meaning that I need _can_hold_na as True. I'd want ExtensionArray as holding it so that arithmetic would work, as coercing to an object array would mess up the arithmetic. So I'd have no problem if we dropped _can_hold_na from the ExtensionArray design, and tell people that they must define a _na_value(), which is then used as the "fill in" when we do things like reindex and shift. And because there are pandas operations that need some NA value for fill in purposes, requiring it on ExtensionArray shouldn't be an issue (IMHO).

If we agree that getting rid of _can_hold_na is the thing to do, I'll modify this PR to totally get rid of it (and the associated tests).

cc @TomAugspurger @jreback

@jreback
Copy link
Contributor

jreback commented Apr 24, 2018

i would not add API unless we have a need for it
so yes would remove until / until is necessary

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Apr 25, 2018

Closing due to new PR #20819

@Dr-Irv Dr-Irv closed this Apr 25, 2018
@jreback jreback added this to the No action milestone Apr 25, 2018
@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 25, 2018
@Dr-Irv Dr-Irv deleted the issue20761 branch June 29, 2018 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExtensionArray._can_hold_na needs to be a class attribute
3 participants