BUG: Accept Generic Array-Like for .where #15414

gfyoung · 2017-02-15T18:36:17Z

Setup:

>>> s = Series([1, 2, 3])
>>> cond = [False, True, True]

Before:

>>> s.where(cond)
...
ValueError: where requires an ndarray like object for its condition

After:

>>> s.where(cond)
0    NaN
1    2.0
2    3.0
dtype: float64

Note that Index objects can accept array-like objects as conditionals, and I wrote tests to confirm.

jreback · 2017-02-15T18:40:33Z

pandas/core/generic.py

-            The callable must not change input %(klass)s
-            (though pandas doesn't check it).
+            should return boolean %(klass)s or array. The callable must
+            not change input %(klass)s (though pandas doesn't check it).


maybe add 0.20.0 now supports array-like?

I don't consider it an enhancement. We generally accept array-like across the board.

jreback · 2017-02-15T18:40:47Z

pandas/tests/frame/test_indexing.py

@@ -2479,6 +2479,25 @@ def _check_set(df, cond, check_dtypes=True):
        expected = df[df['a'] == 1].reindex(df.index)
        assert_frame_equal(result, expected)

+    def test_where_array_like(self):
+        klasses = [list, tuple, np.array]


add this issue number

jreback · 2017-02-15T18:41:16Z

pandas/tests/frame/test_indexing.py

+        df['b'] = 2
+        expected['b'] = [2, np.nan, 2]
+        cond = [[False, True], [True, False], [True, True]]
+


can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)

@gfyoung everything looks lgtm. Can you add some tests for rejected things, ping on green.

jreback · 2017-02-15T18:41:28Z

pandas/tests/indexes/common.py

@@ -497,6 +497,18 @@ def test_where(self):
        result = i.where(cond)
        tm.assert_index_equal(result, expected)

+    def test_where_array_like(self):
+        i = self.create_index()


issue number

These are just confirmation tests. The behavior was never broken for indices.

jreback · 2017-02-15T18:42:18Z

don't suppose there is an issue about this outstanding?

gfyoung · 2017-02-15T18:43:18Z

don't suppose there is an issue about this outstanding?

No, there is not. I just stumbled across this while grappling with the NaN in integer Index.

gfyoung · 2017-02-15T18:44:34Z

@jreback : The Appveyor build is broken for everybody due to conda conflicts.

jreback · 2017-02-15T18:46:04Z

The Appveyor build is broken for everybody due to conda conflicts.

hmm, that's not friendly

jreback · 2017-02-15T18:50:34Z

I think the miniconda installer changed to 3.6.

gfyoung · 2017-02-15T19:20:50Z

I think the miniconda installer changed to 3.6.

Yikes...though the installer website page doesn't seem to reflect that if that's true...

jreback · 2017-02-15T19:24:12Z

neah its not that. I had fixed the conda installer a while back at 4.2.15. So was installing the latest miniconda, but an incompat version of conda :<

switching to latest fixes. pushing soon.

jreback · 2017-02-15T20:22:06Z

ok rebase and windows should be good.

codecov-io · 2017-02-17T08:55:00Z

Codecov Report

Merging #15414 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15414      +/-   ##
==========================================
+ Coverage   90.36%   90.36%   +<.01%     
==========================================
  Files         136      136              
  Lines       49532    49540       +8     
==========================================
+ Hits        44760    44768       +8     
  Misses       4772     4772

Impacted Files	Coverage Δ
pandas/indexes/base.py	`96.22% <ø> (ø)`	✅
pandas/core/generic.py	`96.31% <100%> (+0.01%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81c57e2...5037932. Read the comment docs.

jreback · 2017-02-17T14:56:28Z

pandas/core/generic.py

@@ -4717,24 +4717,41 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
        applied as a function even if callable. Used in __setitem__.
        """
        inplace = validate_bool_kwarg(inplace, 'inplace')
+        inplace = True if inplace else False


no this just makes things confusing, do this below in a condiationl, don't modify inplace

jreback · 2017-02-17T14:56:50Z

pandas/core/generic.py

+        #
+        # Conveniently, 'inplace' matches the boolean with
+        # which we want to fill.
+        cond = cond.fillna(inplace)


just do a conditional (or do bool(inplace) I guess).

jreback · 2017-02-17T14:57:53Z

pandas/core/generic.py

+        msg = "Boolean array expected for the condition, not {dtype}"
+
+        if not isinstance(cond, pd.DataFrame):
+            # This is a single-dimensional object.


none of this appears to be tested.

can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)

I saw your comment above, and I have yet to write those tests. In addition, this branch gets hit all the time for the current .where tests.

true, but not specifically for validation boolean-like inputs.

I'm not sure what you mean here. This branch gets hit whenever you call Series.where

these have not tests for the condition you added at all.

In [1]: s = Series([1,2,3])

In [2]: s.where(s+1) Out[2]: 0 1 1 2 2 3 dtype: int64

makes no sense and is not tested

sure this gets hit all the time with expected input. but I suspect is barely tests with unexpected input.

I'm getting to it with the tests for invalid input. That example should break if you pull down my changes at this point.

ok great, thanks @gfyoung ping when you want review

@jreback : Tests have been added. Have a look.

jreback

see comments

jreback · 2017-02-20T14:29:24Z

pandas/core/generic.py

+                if not is_bool_dtype(dt):
+                    raise ValueError(msg.format(dtype=dt))
+
+        cond = cond.astype(bool)


you can add copy=False

jreback · 2017-02-20T14:31:22Z

pandas/tests/frame/test_indexing.py

+        msg = "Boolean array expected for the condition"
+
+        conds = [
+            [[1], [0], [1]],


can you also add some that have an invalid shape? e.g. a scalar, an unalignable Series, a 1-dim ndarray (or is that broadcast)?

It depends in fact on the type of input you pass in. If you pass in an NDFrame instance for cond, we will "align" it with self, and there is no error. However, if the instance of that same input is not NDFrame, we will error due to misalignment.

ok so this would be good to test then :>

Will do once this recent commit has been tested on Travis and Appveyor.

@jreback : Tests have been added, and everything looks good. Have a look.

jreback · 2017-02-20T14:31:56Z

pandas/tests/series/test_indexing.py

+        msg = "Boolean array expected for the condition"
+
+        conds = [
+            [1, 0, 1],


maybe add a DataFrame as input (maybe that works if the names match?)

Works if names match. Fails if they do not. This is because we "align" the cond input with self before applying it, so if the names don't match cond becomes a DataFrame of NaN.

Yep, test was added just now. Just waiting for Travis and Appveyor to approve now.

jreback · 2017-02-21T13:15:23Z

pandas/core/generic.py

+        # If 'inplace' is True, we want to fill with True
+        # before inverting. If 'inplace' is False, we will
+        # fill with False and do nothing else.
+        #


this is really strange. Why are you not always filing with True? and if not please show me an example that depends on this.

Lots of tests break, even basic ones like this:

>>> s = Series([1, 2, 3]) >>> s.where([True, False, True]) >>> s 0 NaN 1 NaN 2 NaN dtype: float64

jreback · 2017-02-21T13:16:07Z

pandas/core/generic.py

+                if not is_bool_dtype(dt):
+                    raise ValueError(msg.format(dtype=dt))
+
+        cond = cond.astype(bool, copy=False)


and you are flipping this again here? (according to inplace). This is really convoluted

Perhaps, but this change is just a refactor of what was there before. Also, test break if we don't do that inversion, as you saw above.

I understand, but your refactor is now much more confusing. can you try to make this simpler?

Wait, are you saying the original code on master was simpler to understand!? I fail to see how that is the case.

well, your new code is VERY confusing.

That doesn't really answer my question.

jreback · 2017-02-21T13:17:10Z

pandas/tests/frame/test_indexing.py

+
+        conds = [
+            [[1], [0], [1]],
+            Series([[2], [5], [7]]),


why would you use a Series with a list-like? that is really odd here (even to test), just use a regular alignable, non-boolean series.

This is because a list-of-lists is already aligned with DataFrame, and I'm just wrapping it with Series. The only thing I want to be invalid about this input is that it has numeric (instead of boolean) values. I don't want to also test that the Series object gets aligned here.

jreback · 2017-02-21T13:17:17Z

pandas/tests/frame/test_indexing.py

+        df['b'] = 2
+        conds = [
+            [[0, 1], [1, 0], [1, 1]],
+            Series([[0, 2], [5, 0], [4, 7]]),


same answer as above

jreback · 2017-02-21T13:17:40Z

pandas/tests/frame/test_indexing.py

+            [[0, 1], [1, 0], [1, 1]],
+            Series([[0, 2], [5, 0], [4, 7]]),
+            [["False", "True"], ["True", "False"],
+             ["True", "True"]],


try at least a DataFrame as well

jreback · 2017-02-22T17:27:25Z

pandas/core/generic.py

+        # fill with False and do nothing else.
+        #
+        # Conveniently, 'inplace' matches the boolean with
+        # which we want to fill.


this comment in particular is very very odd. what does inplace have to do with the value you fill?

The code on master should illuminate how I refactored everything.

my point is that you are using the inplace variable in a completely odd way. It needs to be factored to avoid that. Its just not readable code.

Fair enough. I rewrote it slightly to hopefully improve readability.

@jreback : Created fill_value variable to make the code more readable. Have a look.

jreback · 2017-02-24T20:05:37Z

thanks. I looked at the previous code and realize that inplace was being used similarly to how you are using it.

jorisvandenbossche · 2017-02-25T22:08:54Z

pandas/tests/frame/test_indexing.py

+
+        expected = DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan]])
+
+        out = df.where(Series(cond))


This already worked like this before I think, but since we explicitly added a test for it: how can this result be explained? The Series aligned on the index and broadcasted along the columns? That is not how alignment otherwise works .. (alignment of cond is also not mentioned in the docstring)

For NDFrame conditionals, we call .align so that no error is raised since the cond becomes aligned with self.

But those do not just align (unless you specify the axis on which to align):

In [32]: df = DataFrame([[1, 2, 3], [4, 5, 6]]) In [33]: cond = [True] In [34]: df.align(Series(cond)) ... ValueError: Must specify axis=0 or 1

There is an axis keyword in where, but that does not seem to have any effect in this case.

So it seems that where aligns on the index, as opposed to eg other operators

I know that. The code does it internally.

Yes, I don't doubt that you know that, but my initial question was: how do we explain this (to users)? This is not documented, this is IMO inconsistent, and one would expect the axis keyword can influence this.

@jorisvandenbossche : Actually, no. Regardless of axis, we are able to align correctly and perform .where. I should add that this behavior existed before my patch and that @jreback just requested that I add tests to check existing behavior.

Whether or not this behavior is correct is another story. I can see your point regarding this inconsistency, but mind you that changing it to err would be an API change.

Yes, I know this behaviour was already like that before your patch, as I mentioned in my initial comment. I just raised this because we where explicitly adding a test for it (while the behaviour could also have a case of 'undefined/uncatched behaviour' that was not explicitly intended).

But probably better to discuss in separate issue.

Other question: what is the purpose of the axis keyword?

OK, I see in the code this is used to align other, not cond.
I think it is rather confusing that cond and other are aligned in a different way.

But probably better to discuss in separate issue.

Agreed.

jorisvandenbossche · 2017-02-25T22:09:02Z

pandas/core/generic.py


            .. versionadded:: 0.18.1

-            A callable can be used as cond.


This belonged to the versionadded directive, so I don't think it should be removed, as now the versionadded indication is not really correct (or remove both)

I personally found that confusing. I wouldn't have known that from reading the original doc.

In the html docs it is slightly clearer. But indeed, for a correct syntax, the empty line should not be there, like

.. versionadded:: 0.18.1 A callable can be used as cond.

@jorisvandenbossche : PR up to patch (#15509)

jreback · 2017-02-25T22:55:28Z

I think the alignment in .where is a mistake, but it probably is never used. We don't do alignment on boolean things anywhere else. So should remove this

Author: gfyoung <[email protected]> Closes pandas-dev#15414 from gfyoung/generic-where-gen-array and squashes the following commits: 5037932 [gfyoung] BUG: Accept generic array-like in .where

gfyoung force-pushed the generic-where-gen-array branch from 129561c to 7288245 Compare February 15, 2017 18:36

jreback reviewed Feb 15, 2017

View reviewed changes

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Indexing Related to indexing on series/frames, not to indexes themselves labels Feb 15, 2017

gfyoung force-pushed the generic-where-gen-array branch 2 times, most recently from bf3bacd to f821bab Compare February 15, 2017 18:50

gfyoung force-pushed the generic-where-gen-array branch from f821bab to a58e322 Compare February 15, 2017 21:12

jreback added this to the 0.20.0 milestone Feb 16, 2017

gfyoung force-pushed the generic-where-gen-array branch from a58e322 to 52c60d6 Compare February 16, 2017 23:17

jreback reviewed Feb 17, 2017

View reviewed changes

jreback requested changes Feb 17, 2017

View reviewed changes

gfyoung force-pushed the generic-where-gen-array branch 2 times, most recently from 94d3cee to 5f755a1 Compare February 18, 2017 21:03

jreback reviewed Feb 20, 2017

View reviewed changes

gfyoung force-pushed the generic-where-gen-array branch 4 times, most recently from fda5980 to 5cb249e Compare February 21, 2017 08:52

jreback reviewed Feb 21, 2017

View reviewed changes

gfyoung force-pushed the generic-where-gen-array branch 3 times, most recently from ae72c0a to c7634ae Compare February 22, 2017 16:49

jreback reviewed Feb 22, 2017

View reviewed changes

gfyoung force-pushed the generic-where-gen-array branch 3 times, most recently from d2b951d to fe4753a Compare February 23, 2017 20:27

BUG: Accept generic array-like in .where

5037932

gfyoung force-pushed the generic-where-gen-array branch from fe4753a to 5037932 Compare February 24, 2017 15:22

jreback closed this in 7e0a71b Feb 24, 2017

gfyoung deleted the generic-where-gen-array branch February 24, 2017 20:36

jorisvandenbossche reviewed Feb 25, 2017

View reviewed changes

gfyoung mentioned this pull request Feb 25, 2017

DOC: Fix versionadded for cond in .where #15509

Merged

jorisvandenbossche mentioned this pull request Mar 3, 2017

DOC/API: clarify usage of where, mask broadcast arguments #15558

Open


		expected = DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan]])

		out = df.where(Series(cond))

BUG: Accept Generic Array-Like for .where #15414

BUG: Accept Generic Array-Like for .where #15414

Conversation

gfyoung commented Feb 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 15, 2017

gfyoung commented Feb 15, 2017 • edited Loading

gfyoung commented Feb 15, 2017

jreback commented Feb 15, 2017

jreback commented Feb 15, 2017

gfyoung commented Feb 15, 2017

jreback commented Feb 15, 2017

jreback commented Feb 15, 2017

codecov-io commented Feb 17, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Feb 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Feb 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 24, 2017

Choose a reason for hiding this comment

gfyoung Feb 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Feb 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Feb 15, 2017 •

edited

Loading

gfyoung commented Feb 15, 2017 •

edited

Loading

codecov-io commented Feb 17, 2017 •

edited

Loading

gfyoung Feb 17, 2017 •

edited

Loading

gfyoung Feb 21, 2017 •

edited

Loading

gfyoung Feb 22, 2017 •

edited

Loading

gfyoung Feb 25, 2017 •

edited

Loading

gfyoung Feb 25, 2017 •

edited

Loading