-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Make ExtensionArray:_can_hold_na a class attribute #20815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
42101d8
f00408e
74df86d
9b04368
8105751
0037881
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
import decimal | ||
|
||
import numpy as np | ||
import pandas as pd | ||
import pandas.util.testing as tm | ||
import pytest | ||
|
||
from pandas.tests.extension import base | ||
|
||
from .array import DecimalDtype, DecimalNoNaArray, make_data | ||
from .test_decimal import ( | ||
BaseDecimal, TestDtype, TestInterface, TestConstructors, | ||
TestReshaping, TestGetitem, TestMissing, TestCasting, | ||
TestGroupby) | ||
|
||
|
||
@pytest.fixture | ||
def dtype(): | ||
return DecimalDtype() | ||
|
||
|
||
@pytest.fixture | ||
def data(): | ||
return DecimalNoNaArray(make_data()) | ||
|
||
|
||
@pytest.fixture | ||
def data_missing(): | ||
pytest.skip("No missing data tests for _can_hold_na=False") | ||
|
||
|
||
@pytest.fixture | ||
def data_for_sorting(): | ||
return DecimalNoNaArray([decimal.Decimal('1'), | ||
decimal.Decimal('2'), | ||
decimal.Decimal('0')]) | ||
|
||
|
||
@pytest.fixture | ||
def data_missing_for_sorting(): | ||
pytest.skip("No missing data tests for _can_hold_na=False") | ||
|
||
|
||
@pytest.fixture | ||
def na_cmp(): | ||
pytest.skip("No missing data tests for _can_hold_na=False") | ||
|
||
|
||
@pytest.fixture | ||
def na_value(): | ||
pytest.skip("No missing data tests for _can_hold_na=False") | ||
|
||
|
||
@pytest.fixture | ||
def data_for_grouping(): | ||
b = decimal.Decimal('1.0') | ||
a = decimal.Decimal('0.0') | ||
c = decimal.Decimal('2.0') | ||
d = decimal.Decimal('-1.0') | ||
return DecimalNoNaArray([b, b, d, d, a, a, b, c]) | ||
|
||
|
||
class TestNoNaDtype(TestDtype): | ||
pass | ||
|
||
|
||
class TestNoNaInterface(TestInterface): | ||
pass | ||
|
||
|
||
class TestNoNaConstructors(TestConstructors): | ||
pass | ||
|
||
|
||
class TestNoNaReshaping(TestReshaping): | ||
pass | ||
|
||
|
||
class TestNoNaGetitem(TestGetitem): | ||
pass | ||
|
||
|
||
class TestNoNaMissing(TestMissing): | ||
pass | ||
|
||
|
||
class TestNoNaMethods(BaseDecimal, base.BaseMethodsTests): | ||
def test_factorize(self, data_for_grouping, na_sentinel=None): | ||
labels, uniques = pd.factorize(data_for_grouping) | ||
expected_labels = np.array([0, 0, 1, | ||
1, 2, 2, 0, 3], | ||
dtype=np.intp) | ||
expected_uniques = data_for_grouping.take([0, 2, 4, 7]) | ||
|
||
tm.assert_numpy_array_equal(labels, expected_labels) | ||
self.assert_extension_array_equal(uniques, expected_uniques) | ||
|
||
|
||
class TestNoNaCasting(TestCasting): | ||
pass | ||
|
||
|
||
class TestNoNaGroupby(TestGroupby): | ||
pass | ||
|
||
|
||
def test_series_constructor_with_same_dtype_ok(): | ||
arr = DecimalNoNaArray([decimal.Decimal('10.0')]) | ||
result = pd.Series(arr, dtype=DecimalDtype()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is maybe something we need to discuss, as in the |
||
expected = pd.Series(arr) | ||
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_series_constructor_coerce_extension_array_to_dtype_raises(): | ||
arr = DecimalNoNaArray([decimal.Decimal('10.0')]) | ||
xpr = r"Cannot specify a dtype 'int64' .* \('decimal'\)." | ||
|
||
with tm.assert_raises_regex(ValueError, xpr): | ||
pd.Series(arr, dtype='int64') | ||
|
||
|
||
def test_dataframe_constructor_with_same_dtype_ok(): | ||
arr = DecimalNoNaArray([decimal.Decimal('10.0')]) | ||
|
||
result = pd.DataFrame({"A": arr}, dtype=DecimalDtype()) | ||
expected = pd.DataFrame({"A": arr}) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_dataframe_constructor_with_different_dtype_raises(): | ||
arr = DecimalNoNaArray([decimal.Decimal('10.0')]) | ||
|
||
xpr = "Cannot coerce extension array to dtype 'int64'. " | ||
with tm.assert_raises_regex(ValueError, xpr): | ||
pd.DataFrame({"A": arr}, dtype='int64') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those constructor tests are not specific to the "nona", so I would move them to a separate file There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just copied them from test_decimal.py and changed the array type. I guess we need something in the base tests. I'll look at that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is what this should do if
_can_hold_na=False
? And in general with -1 indexers (egreindex
,concat
with realignment, ...)I suppose to raise an error? Or coerce to object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that a design decision that sits above this particular test code? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe, but then I am just wondering what this design decision is (and we should also test that here).
As far as I know, this is rather new terrain (in pandas we only have ints, and they are upcasted to float), so it might be we still need to discuss/decide on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have a BoolBlock that cannot hold NaNs, so a boolean Series gets converted to object Series when NaNs are introduced:
So this is some prior art
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So by the statement "we still need to discuss/decide on this", I presume that the core dev team will make that call. Right? (FWIW, I think I may use
_can_hold_na=True
, and since I'm storing objects, I'll just use NaN to be the na_value that is then used inreindex
,shift
, etc.), and I'd want the type of the resulting array to remain in my subclass ofExtensionArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is an open discussion, and your voice in this discussion is important since you are actually one of the first people trying to use this (both Tom and I do not use
_can_hold_na=False
in our applications).But if you would also not use this, we should maybe consider removing the feature for now (meaning this will always be True) until somebody comes up with a good use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I've come full circle on this. In my case, I thought that I would not want NA in my arrays. My objects stored in the arrays have no NA value that makes sense. And then when you brought up
reindex
as an example, I realized that I would use things likeshift
and would want the NA to be filled in, but keeping theExtensionArray
class as the holder, meaning that I need_can_hold_na
as True. I'd wantExtensionArray
as holding it so that arithmetic would work, as coercing to anobject
array would mess up the arithmetic. So I'd have no problem if we dropped_can_hold_na
from theExtensionArray
design, and tell people that they must define a _na_value(), which is then used as the "fill in" when we do things likereindex
andshift
. And because there are pandas operations that need some NA value for fill in purposes, requiring it onExtensionArray
shouldn't be an issue (IMHO).If we agree that getting rid of
_can_hold_na
is the thing to do, I'll modify this PR to totally get rid of it (and the associated tests).