Skip to content

BUG: Accept Generic Array-Like for .where #15414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -548,6 +548,7 @@ Bug Fixes



- Bug in ``Series.where()`` and ``DataFrame.where()`` where array-like conditionals were being rejected (:issue:`15414`)
- Bug in ``Series`` construction with a datetimetz (:issue:`14928`)

- Bug in compat for passing long integers to ``Timestamp.replace`` (:issue:`15030`)
Expand Down
37 changes: 21 additions & 16 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -4732,17 +4732,28 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
cond, _ = cond.align(self, join='right', broadcast_axis=1)
else:
if not hasattr(cond, 'shape'):
raise ValueError('where requires an ndarray like object for '
'its condition')
cond = np.asanyarray(cond)
if cond.shape != self.shape:
raise ValueError('Array conditional must be same shape as '
'self')
cond = self._constructor(cond, **self._construct_axes_dict())

if inplace:
cond = -(cond.fillna(True).astype(bool))
fill_value = True if inplace else False
cond = cond.fillna(fill_value)

msg = "Boolean array expected for the condition, not {dtype}"

if not isinstance(cond, pd.DataFrame):
# This is a single-dimensional object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of this appears to be tested.

Copy link
Member Author

@gfyoung gfyoung Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)

I saw your comment above, and I have yet to write those tests. In addition, this branch gets hit all the time for the current .where tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but not specifically for validation boolean-like inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. This branch gets hit whenever you call Series.where

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have not tests for the condition you added at all.

In [1]: s = Series([1,2,3])

In [2]: s.where(s+1)
Out[2]: 
0    1
1    2
2    3
dtype: int64

makes no sense and is not tested

sure this gets hit all the time with expected input. but I suspect is barely tests with unexpected input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting to it with the tests for invalid input. That example should break if you pull down my changes at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great, thanks @gfyoung ping when you want review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Tests have been added. Have a look.

if not is_bool_dtype(cond):
raise ValueError(msg.format(dtype=cond.dtype))
else:
cond = cond.fillna(False).astype(bool)
for dt in cond.dtypes:
if not is_bool_dtype(dt):
raise ValueError(msg.format(dtype=dt))

cond = cond.astype(bool, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you are flipping this again here? (according to inplace). This is really convoluted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but this change is just a refactor of what was there before. Also, test break if we don't do that inversion, as you saw above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but your refactor is now much more confusing. can you try to make this simpler?

Copy link
Member Author

@gfyoung gfyoung Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, are you saying the original code on master was simpler to understand!? I fail to see how that is the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, your new code is VERY confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't really answer my question.

cond = -cond if inplace else cond

# try to align
try_quick = True
Expand Down Expand Up @@ -4891,26 +4902,20 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belonged to the versionadded directive, so I don't think it should be removed, as now the versionadded indication is not really correct (or remove both)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally found that confusing. I wouldn't have known that from reading the original doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the html docs it is slightly clearer. But indeed, for a correct syntax, the empty line should not be there, like

.. versionadded:: 0.18.1
    A callable can be used as cond.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : PR up to patch (#15509)

Parameters
----------
cond : boolean %(klass)s, array or callable
cond : boolean %(klass)s, array-like, or callable
If cond is callable, it is computed on the %(klass)s and
should return boolean %(klass)s or array.
The callable must not change input %(klass)s
(though pandas doesn't check it).
should return boolean %(klass)s or array. The callable must
not change input %(klass)s (though pandas doesn't check it).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add 0.20.0 now supports array-like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider it an enhancement. We generally accept array-like across the board.


.. versionadded:: 0.18.1

A callable can be used as cond.

other : scalar, %(klass)s, or callable
If other is callable, it is computed on the %(klass)s and
should return scalar or %(klass)s.
The callable must not change input %(klass)s
(though pandas doesn't check it).
should return scalar or %(klass)s. The callable must not
change input %(klass)s (though pandas doesn't check it).

.. versionadded:: 0.18.1

A callable can be used as other.

inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
Expand Down
2 changes: 1 addition & 1 deletion pandas/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -573,7 +573,7 @@ def repeat(self, repeats, *args, **kwargs):

Parameters
----------
cond : boolean same length as self
cond : boolean array-like with the same length as self
other : scalar, or array-like
"""

Expand Down
89 changes: 89 additions & 0 deletions pandas/tests/frame/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2479,6 +2479,95 @@ def _check_set(df, cond, check_dtypes=True):
expected = df[df['a'] == 1].reindex(df.index)
assert_frame_equal(result, expected)

def test_where_array_like(self):
# see gh-15414
klasses = [list, tuple, np.array]

df = DataFrame({'a': [1, 2, 3]})
cond = [[False], [True], [True]]
expected = DataFrame({'a': [np.nan, 2, 3]})

for klass in klasses:
result = df.where(klass(cond))
assert_frame_equal(result, expected)

df['b'] = 2
expected['b'] = [2, np.nan, 2]
cond = [[False, True], [True, False], [True, True]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung everything looks lgtm. Can you add some tests for rejected things, ping on green.

for klass in klasses:
result = df.where(klass(cond))
assert_frame_equal(result, expected)

def test_where_invalid_input(self):
# see gh-15414: only boolean arrays accepted
df = DataFrame({'a': [1, 2, 3]})
msg = "Boolean array expected for the condition"

conds = [
[[1], [0], [1]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add some that have an invalid shape? e.g. a scalar, an unalignable Series, a 1-dim ndarray (or is that broadcast)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends in fact on the type of input you pass in. If you pass in an NDFrame instance for cond, we will "align" it with self, and there is no error. However, if the instance of that same input is not NDFrame, we will error due to misalignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so this would be good to test then :>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do once this recent commit has been tested on Travis and Appveyor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Tests have been added, and everything looks good. Have a look.

Series([[2], [5], [7]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you use a Series with a list-like? that is really odd here (even to test), just use a regular alignable, non-boolean series.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because a list-of-lists is already aligned with DataFrame, and I'm just wrapping it with Series. The only thing I want to be invalid about this input is that it has numeric (instead of boolean) values. I don't want to also test that the Series object gets aligned here.

DataFrame({'a': [2, 5, 7]}),
[["True"], ["False"], ["True"]],
[[Timestamp("2017-01-01")],
[pd.NaT], [Timestamp("2017-01-02")]]
]

for cond in conds:
with tm.assertRaisesRegexp(ValueError, msg):
df.where(cond)

df['b'] = 2
conds = [
[[0, 1], [1, 0], [1, 1]],
Series([[0, 2], [5, 0], [4, 7]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer as above

[["False", "True"], ["True", "False"],
["True", "True"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try at least a DataFrame as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

DataFrame({'a': [2, 5, 7], 'b': [4, 8, 9]}),
[[pd.NaT, Timestamp("2017-01-01")],
[Timestamp("2017-01-02"), pd.NaT],
[Timestamp("2017-01-03"), Timestamp("2017-01-03")]]
]

for cond in conds:
with tm.assertRaisesRegexp(ValueError, msg):
df.where(cond)

def test_where_dataframe_col_match(self):
df = DataFrame([[1, 2, 3], [4, 5, 6]])
cond = DataFrame([[True, False, True], [False, False, True]])

out = df.where(cond)
expected = DataFrame([[1.0, np.nan, 3], [np.nan, np.nan, 6]])
tm.assert_frame_equal(out, expected)

cond.columns = ["a", "b", "c"] # Columns no longer match.
msg = "Boolean array expected for the condition"
with tm.assertRaisesRegexp(ValueError, msg):
df.where(cond)

def test_where_ndframe_align(self):
msg = "Array conditional must be same shape as self"
df = DataFrame([[1, 2, 3], [4, 5, 6]])

cond = [True]
with tm.assertRaisesRegexp(ValueError, msg):
df.where(cond)

expected = DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan]])

out = df.where(Series(cond))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already worked like this before I think, but since we explicitly added a test for it: how can this result be explained? The Series aligned on the index and broadcasted along the columns? That is not how alignment otherwise works .. (alignment of cond is also not mentioned in the docstring)

Copy link
Member Author

@gfyoung gfyoung Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NDFrame conditionals, we call .align so that no error is raised since the cond becomes aligned with self.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But those do not just align (unless you specify the axis on which to align):

In [32]: df = DataFrame([[1, 2, 3], [4, 5, 6]])

In [33]: cond = [True]

In [34]: df.align(Series(cond))
...
ValueError: Must specify axis=0 or 1

There is an axis keyword in where, but that does not seem to have any effect in this case.

So it seems that where aligns on the index, as opposed to eg other operators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that. The code does it internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't doubt that you know that, but my initial question was: how do we explain this (to users)? This is not documented, this is IMO inconsistent, and one would expect the axis keyword can influence this.

Copy link
Member Author

@gfyoung gfyoung Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : Actually, no. Regardless of axis, we are able to align correctly and perform .where. I should add that this behavior existed before my patch and that @jreback just requested that I add tests to check existing behavior.

Whether or not this behavior is correct is another story. I can see your point regarding this inconsistency, but mind you that changing it to err would be an API change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know this behaviour was already like that before your patch, as I mentioned in my initial comment. I just raised this because we where explicitly adding a test for it (while the behaviour could also have a case of 'undefined/uncatched behaviour' that was not explicitly intended).

But probably better to discuss in separate issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other question: what is the purpose of the axis keyword?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see in the code this is used to align other, not cond.
I think it is rather confusing that cond and other are aligned in a different way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But probably better to discuss in separate issue.

Agreed.

tm.assert_frame_equal(out, expected)

cond = np.array([False, True, False, True])
with tm.assertRaisesRegexp(ValueError, msg):
df.where(cond)

expected = DataFrame([[np.nan, np.nan, np.nan], [4, 5, 6]])

out = df.where(Series(cond))
tm.assert_frame_equal(out, expected)

def test_where_bug(self):

# GH 2793
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/indexes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -497,6 +497,18 @@ def test_where(self):
result = i.where(cond)
tm.assert_index_equal(result, expected)

def test_where_array_like(self):
i = self.create_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just confirmation tests. The behavior was never broken for indices.


_nan = i._na_value
cond = [False] + [True] * (len(i) - 1)
klasses = [list, tuple, np.array, pd.Series]
expected = pd.Index([_nan] + i[1:].tolist(), dtype=i.dtype)

for klass in klasses:
result = i.where(klass(cond))
tm.assert_index_equal(result, expected)

def test_setops_errorcases(self):
for name, idx in compat.iteritems(self.indices):
# # non-iterable input
Expand Down
11 changes: 10 additions & 1 deletion pandas/tests/indexes/period/test_period.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,22 @@ def test_where(self):
expected = i
tm.assert_index_equal(result, expected)

i2 = i.copy()
i2 = pd.PeriodIndex([pd.NaT, pd.NaT] + i[2:].tolist(),
freq='D')
result = i.where(notnull(i2))
expected = i2
tm.assert_index_equal(result, expected)

def test_where_array_like(self):
i = self.create_index()
cond = [False] + [True] * (len(i) - 1)
klasses = [list, tuple, np.array, Series]
expected = pd.PeriodIndex([pd.NaT] + i[1:].tolist(), freq='D')

for klass in klasses:
result = i.where(klass(cond))
tm.assert_index_equal(result, expected)

def test_where_other(self):

i = self.create_index()
Expand Down
12 changes: 11 additions & 1 deletion pandas/tests/indexes/test_category.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,13 +240,23 @@ def test_where(self):
expected = i
tm.assert_index_equal(result, expected)

i2 = i.copy()
i2 = pd.CategoricalIndex([np.nan, np.nan] + i[2:].tolist(),
categories=i.categories)
result = i.where(notnull(i2))
expected = i2
tm.assert_index_equal(result, expected)

def test_where_array_like(self):
i = self.create_index()
cond = [False] + [True] * (len(i) - 1)
klasses = [list, tuple, np.array, pd.Series]
expected = pd.CategoricalIndex([np.nan] + i[1:].tolist(),
categories=i.categories)

for klass in klasses:
result = i.where(klass(cond))
tm.assert_index_equal(result, expected)

def test_append(self):

ci = self.create_index()
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/indexes/test_multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,15 @@ def f():

self.assertRaises(NotImplementedError, f)

def test_where_array_like(self):
i = MultiIndex.from_tuples([('A', 1), ('A', 2)])
klasses = [list, tuple, np.array, pd.Series]
cond = [False, True]

for klass in klasses:
f = lambda: i.where(klass(cond))
self.assertRaises(NotImplementedError, f)

def test_repeat(self):
reps = 2
numbers = [1, 2, 3]
Expand Down
54 changes: 54 additions & 0 deletions pandas/tests/series/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1193,6 +1193,60 @@ def f():
expected = Series(np.nan, index=[9])
assert_series_equal(result, expected)

def test_where_array_like(self):
# see gh-15414
s = Series([1, 2, 3])
cond = [False, True, True]
expected = Series([np.nan, 2, 3])
klasses = [list, tuple, np.array, Series]

for klass in klasses:
result = s.where(klass(cond))
assert_series_equal(result, expected)

def test_where_invalid_input(self):
# see gh-15414: only boolean arrays accepted
s = Series([1, 2, 3])
msg = "Boolean array expected for the condition"

conds = [
[1, 0, 1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a DataFrame as input (maybe that works if the names match?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works if names match. Fails if they do not. This is because we "align" the cond input with self before applying it, so if the names don't match cond becomes a DataFrame of NaN.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, test was added just now. Just waiting for Travis and Appveyor to approve now.

Series([2, 5, 7]),
["True", "False", "True"],
[Timestamp("2017-01-01"),
pd.NaT, Timestamp("2017-01-02")]
]

for cond in conds:
with tm.assertRaisesRegexp(ValueError, msg):
s.where(cond)

msg = "Array conditional must be same shape as self"
with tm.assertRaisesRegexp(ValueError, msg):
s.where([True])

def test_where_ndframe_align(self):
msg = "Array conditional must be same shape as self"
s = Series([1, 2, 3])

cond = [True]
with tm.assertRaisesRegexp(ValueError, msg):
s.where(cond)

expected = Series([1, np.nan, np.nan])

out = s.where(Series(cond))
tm.assert_series_equal(out, expected)

cond = np.array([False, True, False, True])
with tm.assertRaisesRegexp(ValueError, msg):
s.where(cond)

expected = Series([np.nan, 2, np.nan])

out = s.where(Series(cond))
tm.assert_series_equal(out, expected)

def test_where_setitem_invalid(self):

# GH 2702
Expand Down