TST: add tests for take() on empty arrays #20582

jorisvandenbossche · 2018-04-02T12:48:48Z

Another noticed during geopandas testing: ExtensionArray.take needs to be able to handle the case where it is empty.

Added a example implementation for Decimal/JSONArray.
And apparently, this was also failing for Categorical, so fixed that as well.

jorisvandenbossche · 2018-04-02T12:50:30Z

pandas/core/arrays/categorical.py

+            else:
+                raise IndexError(
+                    "cannot do a non-empty take from an empty array.")
+
        codes = take_1d(self._codes, indexer, allow_fill=True, fill_value=-1)


For the internal use cases (Categorical at the moment, later maybe more arrays), I could also push this test to take_1d. But, that is also used by many other things, so not sure I can just do that.

this should not be here rather in take1d (i see you comment)

I simplified the check a bit here (as it only needed to check the empty and not-all -1 case, as when indexer is all -1 take_1d works as expected on empty arrays).

@jreback I am not sure if I should move it to take_1d. It depends on the guarantees we want to give (ourselves) for take_1d (if it should do such bounds checking). It might be that currently all codes that uses take_1d/nd already does this bounds checking (I did not find a practical use case (apart from Categorical.take itself) where we run into this bug), and then doing another one in take_nd would give a performance penalty.

jorisvandenbossche · 2018-04-02T12:52:21Z

pandas/tests/extension/base/getitem.py

+        na_cmp(result[0], na_value)
+
+        with tm.assert_raises_regex(IndexError, "cannot do a non-empty take"):
+            empty.take([0, 1])


@TomAugspurger I didn't find any existing tests for take, so not sure the indexing tests is the best place (maybe rather the BaseMethodsTests).
And, I could also add tests for the actual use cases where you get this (eg reindex on an empty series)

This seems like the right place for testing a behavior specific to take.

jreback · 2018-04-02T12:55:02Z

pandas/tests/extension/decimal/array.py

        indexer = np.asarray(indexer)
        mask = indexer == -1

+        # take on empty array
+        if not len(self):


DecimalArray.take does not use take_1d, so it's not possible to move it

then you need to ask yourself if this is the correct implementation. pandas implementst take_1d for exactly this reason.

take_1d is not a public function. Do you want to expose it publicly?

TomAugspurger

Could you document this behavior in ExtensionArray.take, either the docstring or as a comment?

TomAugspurger · 2018-04-02T13:30:22Z

pandas/tests/extension/base/getitem.py

+        na_cmp(result[0], na_value)
+
+        with tm.assert_raises_regex(IndexError, "cannot do a non-empty take"):
+            empty.take([0, 1])


This seems like the right place for testing a behavior specific to take.

TomAugspurger · 2018-04-02T13:31:50Z

pandas/tests/extension/decimal/array.py

@@ -62,7 +62,7 @@ def __len__(self):
        return len(self.values)

    def __repr__(self):
-        return repr(self.values)
+        return "DecimalArray: " + repr(self.values)


Thanks, meant to add this a while ago :)

codecov · 2018-04-02T15:02:38Z

Codecov Report

❗ No coverage uploaded for pull request base (master@da33359). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #20582   +/-   ##
=========================================
  Coverage          ?   91.84%           
=========================================
  Files             ?      153           
  Lines             ?    49279           
  Branches          ?        0           
=========================================
  Hits              ?    45261           
  Misses            ?     4018           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.23% <ø> (?)`
#single	`41.9% <ø> (?)`

Impacted Files	Coverage Δ
pandas/core/arrays/base.py	`84.14% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da33359...b5d357f. Read the comment docs.

jorisvandenbossche · 2018-04-02T15:17:01Z

pandas/core/arrays/base.py

+                       return type(self)([self._na_value] * len(indexer))
+                   else:
+                       raise IndexError(
+                           "cannot do a non-empty take from an empty array.")


@TomAugspurger Adding the full example is maybe a bit too much (it defeats a bit the purpose of the small illustrative example)? On the other hand, it is needed for a correct implementation.

Agreed... I think that just having a comment noting the correct thing to do for .take on an empty EA is fine.

@TomAugspurger I added some explanation in the docstring, but I decided to keep the code example as well: it is now a bit shorter + it is actually really needed to get a correct implementation.

jreback · 2018-04-02T22:25:35Z

pandas/core/arrays/base.py

+                   # only valid if result is an all-missing array
+                   if mask.all():
+                       return type(self)([self._na_value] * len(indexer))
+                   else:


you don't need the else, just raise

Yeah, the 'else' clause is indeed not essential. I adapted the error message slightly ("from empty axes" -> "from empty array") because I did not find it the best for ExtensionArrays, but given the minor change I can certainly leave it out for code simplicity.

pls make this change.

also combine the comments, this doesn't read very well

yep, did it already locally, didn't push yet

jreback · 2018-04-02T22:26:31Z

pandas/core/arrays/categorical.py

+        indexer = np.asarray(indexer)
+
+        # take on empty array only valid if result is an all-missing array
+        if not len(self) and not (indexer == -1).all():


this should be in take_1d

pls make this change

Did you see my answer/question the first time you asked this?

I gave reasons not to do this, so at least please answer to them.

I already responsded. take_1d can certainly be used in an implementation of this. This is the general implementation of virtually all take behavior, so naturally a check for no indexers should happen there, rather than in each higher level implementation (e.g. EA). The point is that the way you are advocating is simply more complex / buggy and adds more code. Virtually all EA methods should use API's that pandas already has. Should these be public. Sure if possible. But for something like this we don't want to make the actual implementation public as it has certain guarantees. However I don't see any problem with an internal implementation (like EA) using it. it fact I would say they have / should use it. otherwise how else would categorical/DTI/II be implemented at all?

jreback · 2018-04-02T22:26:50Z

pandas/core/arrays/base.py

+                       return type(self)([self._na_value] * len(indexer))
+                   else:
+                       raise IndexError(
+                           "cannot do a non-empty take from an empty array.")


is this raise condition hit in a test?

Yes, see the single test I added in this PR: it tests both the all missing case and this error

jreback · 2018-04-02T22:27:12Z

pandas/tests/extension/decimal/array.py

+            # only valid if result is an all-missing array
+            if mask.all():
+                return type(self)([self._na_value] * len(indexer))
+            else:


no else needed. is the raise hit?

jreback · 2018-04-02T22:28:58Z

pandas/tests/extension/decimal/array.py

+        # take on empty array
+        if not len(self):
+            # only valid if result is an all-missing array
+            if mask.all():


writing take is a fair amount of effort on the extension writer. meaning additional complexity and have to handle edge cases. take is a pure indexing operation and so ideally should be implemetned as high up in the stack as possible (meaning in the base class for EA) if the EA writer wants to override then ok, but it should just work generally. All of the machinery is already there. The point of the EA is to make it easy, having to handle all of these edge cases is not easy.

All of the machinery is already there.

Assuming you have an ndarray. Do we want to provide a default implementation that does an astype(object) to convert to an ndarray of scalars, take on that, and then construct a new EA?

I'm going to make a PR documenting which methods use .astype(object), so that the EA author should consider overriding them if performance is a concern.

Assuming you have an ndarray. Do we want to provide a default implementation that does an astype(object) to convert to an ndarray of scalars, take on that, and then construct a new EA?

It will almost never be what you want I think. It could still serve as an example implementation of course, like we now have one in the docstring.

jreback · 2018-04-02T22:29:53Z

pandas/tests/extension/decimal/array.py

        indexer = np.asarray(indexer)
        mask = indexer == -1

+        # take on empty array
+        if not len(self):


then you need to ask yourself if this is the correct implementation. pandas implementst take_1d for exactly this reason.

jreback · 2018-04-05T15:32:55Z

pandas/core/arrays/base.py

+                   # only valid if result is an all-missing array
+                   if mask.all():
+                       return type(self)([self._na_value] * len(indexer))
+                   else:


pls make this change.

also combine the comments, this doesn't read very well

jreback · 2018-04-05T15:33:20Z

pandas/core/arrays/categorical.py

+        indexer = np.asarray(indexer)
+
+        # take on empty array only valid if result is an all-missing array
+        if not len(self) and not (indexer == -1).all():


pls make this change

jreback · 2018-04-05T15:34:20Z

pandas/core/arrays/categorical.py

+
+        # take on empty array only valid if result is an all-missing array
+        if not len(self) and not (indexer == -1).all():
+            raise IndexError("cannot do a non-empty take from an empty array.")


this also needs a test

This is already tested in the extension array tests for categorical (do you want a duplicate test in the categorical tests?)

jorisvandenbossche · 2018-04-09T07:35:35Z

Regarding the discussion whether this "indexer of all -1 in case of empty array"-check should be in take_1d or in the array's take.
I agree ideally we should do this in take_1d. I just want to note that, currently, we don't do any bounds checking in take_1d (also not for eg too large indices). So it might be all methods calling take_1d already do this and we didn't add it to take_1d for performance reasons, there might be other reasons, .. I don't know.

It's just a bit a bigger endeavour to fix that. Of course I can rather easily add this two-line check to take_1d, but then we get into a state where take_1d does only a partial incomplete bounds check. So ideally we should decide on how low or high level we see take_1d (or eg add another take method that is more strict in checking input, to not have to change existing internal code), and if we decide that take_1d is a high level function that should guarantee bounds checking, update all code that uses it.

jorisvandenbossche · 2018-04-12T09:59:26Z

Putting @jreback's comment here to not have it lost in a hidden inline comment:

I already responsded. take_1d can certainly be used in an implementation of this. This is the general implementation of virtually all take behavior, so naturally a check for no indexers should happen there, rather than in each higher level implementation (e.g. EA). The point is that the way you are advocating is simply more complex / buggy and adds more code. Virtually all EA methods should use API's that pandas already has. Should these be public. Sure if possible. But for something like this we don't want to make the actual implementation public as it has certain guarantees. However I don't see any problem with an internal implementation (like EA) using it. it fact I would say they have / should use it. otherwise how else would categorical/DTI/II be implemented at all?

If you can point me to an answer that is not just "this should be in take_1d", I don't find it :)

I removed the changes in this PR regarding Categorical (and opened a separate issue for that: #20664), so only adding tests for the ExtensionArray and the test implementations.
So we can keep bigger take_1d changes for another PR.

take_1d is internally used in many cases where the indexer passed to take_1d is the result of eg get_indexer, so this ensures that the indexer is correct, and for those no out-of-bounds checks are needed.

You advocate for that ExtensionArrays should use pandas take implementation, but at the same time say that you don't want to make it public? That seems to contradict each other.

If we want to expose our take implementation to extension array authors (and maybe we are that ourselves as well), I would personally create a new take functions which just calls the existing take_1d, but additionally does first bounds checking.
That way, we keep on using the faster take_1d functions internally where we know bounds checks are not needed, and we can make the new take method public for external usage as well.

jorisvandenbossche · 2018-04-12T10:07:59Z

@jreback I repeated some of my comments above about take_1d / fixing a take implementation itself so it can be used in extension arrays, in #20640.
So maybe we can continue the discussion on that topic there? And then here only further discuss the actual changes in this PR (I removed everything related to Categorical, I am only adding some generic take tests and fixing it for the test example arrays).

jreback · 2018-04-14T13:54:03Z

lgtm. though seems to be failing the numpy master build?

jorisvandenbossche · 2018-04-16T07:03:59Z

though seems to be failing the numpy master build?

Seems to be unrelated clipboard failures

jorisvandenbossche · 2018-04-16T07:19:55Z

Added one more test (for reindex, which was the actual method where this difference in the take API surfaced).

jorisvandenbossche · 2018-04-17T07:53:26Z

Appveyor failure is unrelated and was succeeding before.

BUG: fix take() on empty arrays

4d7cc21

jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 2, 2018

jorisvandenbossche added this to the 0.23.0 milestone Apr 2, 2018

jorisvandenbossche commented Apr 2, 2018

View reviewed changes

jreback requested changes Apr 2, 2018

View reviewed changes

TomAugspurger reviewed Apr 2, 2018

View reviewed changes

simplify check in categorical

aba95d5

add docs

3d6fd2e

jorisvandenbossche commented Apr 2, 2018

View reviewed changes

jreback requested changes Apr 2, 2018

View reviewed changes

jreback requested changes Apr 5, 2018

View reviewed changes

simplify check

eacb3d5

jorisvandenbossche mentioned this pull request Apr 9, 2018

API: take interface for (Extension)Array-likes #20640

Closed

jorisvandenbossche added 2 commits April 12, 2018 11:26

Merge remote-tracking branch 'upstream/master' into test-ea-empty-take

ee66d61

expand take test + undo changes in categorical

c4faf8e

jorisvandenbossche mentioned this pull request Apr 12, 2018

BUG: Categorical.take and Series([Categorical]).take is inconsistent with other dtypes #20664

Closed

jreback approved these changes Apr 14, 2018

View reviewed changes

add reindex test

9257203

add failing test for Series.take

a3f88ee

jorisvandenbossche changed the title ~~BUG: fix take() on empty arrays~~ TST: add tests for take() on empty arrays Apr 16, 2018

Merge remote-tracking branch 'upstream/master' into test-ea-empty-take

b5d357f

jorisvandenbossche merged commit 6245e8c into pandas-dev:master Apr 17, 2018

jorisvandenbossche deleted the test-ea-empty-take branch April 17, 2018 07:53

Uh oh!

TST: add tests for take() on empty arrays #20582

TST: add tests for take() on empty arrays #20582

Uh oh!

Conversation

jorisvandenbossche commented Apr 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2018 •

edited

Loading