-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST: add tests for take() on empty arrays #20582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: add tests for take() on empty arrays #20582
Conversation
pandas/core/arrays/categorical.py
Outdated
else: | ||
raise IndexError( | ||
"cannot do a non-empty take from an empty array.") | ||
|
||
codes = take_1d(self._codes, indexer, allow_fill=True, fill_value=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the internal use cases (Categorical at the moment, later maybe more arrays), I could also push this test to take_1d
. But, that is also used by many other things, so not sure I can just do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be here rather in take1d (i see you comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the check a bit here (as it only needed to check the empty and not-all -1 case, as when indexer is all -1 take_1d
works as expected on empty arrays).
@jreback I am not sure if I should move it to take_1d
. It depends on the guarantees we want to give (ourselves) for take_1d
(if it should do such bounds checking). It might be that currently all codes that uses take_1d/nd
already does this bounds checking (I did not find a practical use case (apart from Categorical.take itself) where we run into this bug), and then doing another one in take_nd
would give a performance penalty.
na_cmp(result[0], na_value) | ||
|
||
with tm.assert_raises_regex(IndexError, "cannot do a non-empty take"): | ||
empty.take([0, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger I didn't find any existing tests for take
, so not sure the indexing tests is the best place (maybe rather the BaseMethodsTests
).
And, I could also add tests for the actual use cases where you get this (eg reindex
on an empty series)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the right place for testing a behavior specific to take.
indexer = np.asarray(indexer) | ||
mask = indexer == -1 | ||
|
||
# take on empty array | ||
if not len(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecimalArray.take does not use take_1d
, so it's not possible to move it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you need to ask yourself if this is the correct implementation. pandas implementst take_1d for exactly this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take_1d
is not a public function. Do you want to expose it publicly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document this behavior in ExtensionArray.take, either the docstring or as a comment?
na_cmp(result[0], na_value) | ||
|
||
with tm.assert_raises_regex(IndexError, "cannot do a non-empty take"): | ||
empty.take([0, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the right place for testing a behavior specific to take.
@@ -62,7 +62,7 @@ def __len__(self): | |||
return len(self.values) | |||
|
|||
def __repr__(self): | |||
return repr(self.values) | |||
return "DecimalArray: " + repr(self.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, meant to add this a while ago :)
Codecov Report
@@ Coverage Diff @@
## master #20582 +/- ##
=========================================
Coverage ? 91.84%
=========================================
Files ? 153
Lines ? 49279
Branches ? 0
=========================================
Hits ? 45261
Misses ? 4018
Partials ? 0
Continue to review full report at Codecov.
|
pandas/core/arrays/base.py
Outdated
return type(self)([self._na_value] * len(indexer)) | ||
else: | ||
raise IndexError( | ||
"cannot do a non-empty take from an empty array.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger Adding the full example is maybe a bit too much (it defeats a bit the purpose of the small illustrative example)? On the other hand, it is needed for a correct implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed... I think that just having a comment noting the correct thing to do for .take
on an empty EA is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger I added some explanation in the docstring, but I decided to keep the code example as well: it is now a bit shorter + it is actually really needed to get a correct implementation.
pandas/core/arrays/base.py
Outdated
# only valid if result is an all-missing array | ||
if mask.all(): | ||
return type(self)([self._na_value] * len(indexer)) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the else, just raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the 'else' clause is indeed not essential. I adapted the error message slightly ("from empty axes" -> "from empty array") because I did not find it the best for ExtensionArrays, but given the minor change I can certainly leave it out for code simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls make this change.
also combine the comments, this doesn't read very well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, did it already locally, didn't push yet
pandas/core/arrays/categorical.py
Outdated
indexer = np.asarray(indexer) | ||
|
||
# take on empty array only valid if result is an all-missing array | ||
if not len(self) and not (indexer == -1).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in take_1d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see my answer/question the first time you asked this?
I gave reasons not to do this, so at least please answer to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already responsded. take_1d
can certainly be used in an implementation of this. This is the general implementation of virtually all take behavior, so naturally a check for no indexers should happen there, rather than in each higher level implementation (e.g. EA). The point is that the way you are advocating is simply more complex / buggy and adds more code. Virtually all EA methods should use API's that pandas already has. Should these be public. Sure if possible. But for something like this we don't want to make the actual implementation public as it has certain guarantees. However I don't see any problem with an internal implementation (like EA) using it. it fact I would say they have / should use it. otherwise how else would categorical/DTI/II be implemented at all?
pandas/core/arrays/base.py
Outdated
return type(self)([self._na_value] * len(indexer)) | ||
else: | ||
raise IndexError( | ||
"cannot do a non-empty take from an empty array.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this raise condition hit in a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see the single test I added in this PR: it tests both the all missing case and this error
# only valid if result is an all-missing array | ||
if mask.all(): | ||
return type(self)([self._na_value] * len(indexer)) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no else needed. is the raise hit?
# take on empty array | ||
if not len(self): | ||
# only valid if result is an all-missing array | ||
if mask.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writing take is a fair amount of effort on the extension writer. meaning additional complexity and have to handle edge cases. take is a pure indexing operation and so ideally should be implemetned as high up in the stack as possible (meaning in the base class for EA) if the EA writer wants to override then ok, but it should just work generally. All of the machinery is already there. The point of the EA is to make it easy, having to handle all of these edge cases is not easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the machinery is already there.
Assuming you have an ndarray. Do we want to provide a default implementation that does an astype(object)
to convert to an ndarray of scalars, take
on that, and then construct a new EA?
I'm going to make a PR documenting which methods use .astype(object)
, so that the EA author should consider overriding them if performance is a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you have an ndarray. Do we want to provide a default implementation that does an astype(object) to convert to an ndarray of scalars, take on that, and then construct a new EA?
It will almost never be what you want I think. It could still serve as an example implementation of course, like we now have one in the docstring.
indexer = np.asarray(indexer) | ||
mask = indexer == -1 | ||
|
||
# take on empty array | ||
if not len(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you need to ask yourself if this is the correct implementation. pandas implementst take_1d for exactly this reason.
pandas/core/arrays/base.py
Outdated
# only valid if result is an all-missing array | ||
if mask.all(): | ||
return type(self)([self._na_value] * len(indexer)) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls make this change.
also combine the comments, this doesn't read very well
pandas/core/arrays/categorical.py
Outdated
indexer = np.asarray(indexer) | ||
|
||
# take on empty array only valid if result is an all-missing array | ||
if not len(self) and not (indexer == -1).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls make this change
pandas/core/arrays/categorical.py
Outdated
|
||
# take on empty array only valid if result is an all-missing array | ||
if not len(self) and not (indexer == -1).all(): | ||
raise IndexError("cannot do a non-empty take from an empty array.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also needs a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already tested in the extension array tests for categorical (do you want a duplicate test in the categorical tests?)
Regarding the discussion whether this "indexer of all -1 in case of empty array"-check should be in It's just a bit a bigger endeavour to fix that. Of course I can rather easily add this two-line check to |
Putting @jreback's comment here to not have it lost in a hidden inline comment:
If you can point me to an answer that is not just "this should be in take_1d", I don't find it :) I removed the changes in this PR regarding Categorical (and opened a separate issue for that: #20664), so only adding tests for the ExtensionArray and the test implementations.
You advocate for that ExtensionArrays should use pandas take implementation, but at the same time say that you don't want to make it public? That seems to contradict each other. If we want to expose our |
@jreback I repeated some of my comments above about |
lgtm. though seems to be failing the numpy master build? |
Seems to be unrelated clipboard failures |
Added one more test (for |
Appveyor failure is unrelated and was succeeding before. |
Another noticed during geopandas testing:
ExtensionArray.take
needs to be able to handle the case where it is empty.Added a example implementation for Decimal/JSONArray.
And apparently, this was also failing for Categorical, so fixed that as well.