Skip to content

BUG: Accept Generic Array-Like for .where #15414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Feb 15, 2017

Setup:

>>> s = Series([1, 2, 3])
>>> cond = [False, True, True]

Before:

>>> s.where(cond)
...
ValueError: where requires an ndarray like object for its condition

After:

>>> s.where(cond)
0    NaN
1    2.0
2    3.0
dtype: float64

Note that Index objects can accept array-like objects as conditionals, and I wrote tests to confirm.

@gfyoung gfyoung force-pushed the generic-where-gen-array branch from 129561c to 7288245 Compare February 15, 2017 18:36
The callable must not change input %(klass)s
(though pandas doesn't check it).
should return boolean %(klass)s or array. The callable must
not change input %(klass)s (though pandas doesn't check it).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add 0.20.0 now supports array-like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider it an enhancement. We generally accept array-like across the board.

@@ -2479,6 +2479,25 @@ def _check_set(df, cond, check_dtypes=True):
expected = df[df['a'] == 1].reindex(df.index)
assert_frame_equal(result, expected)

def test_where_array_like(self):
klasses = [list, tuple, np.array]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this issue number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

df['b'] = 2
expected['b'] = [2, np.nan, 2]
cond = [[False, True], [True, False], [True, True]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung everything looks lgtm. Can you add some tests for rejected things, ping on green.

@@ -497,6 +497,18 @@ def test_where(self):
result = i.where(cond)
tm.assert_index_equal(result, expected)

def test_where_array_like(self):
i = self.create_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just confirmation tests. The behavior was never broken for indices.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Indexing Related to indexing on series/frames, not to indexes themselves labels Feb 15, 2017
@jreback
Copy link
Contributor

jreback commented Feb 15, 2017

don't suppose there is an issue about this outstanding?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 15, 2017

don't suppose there is an issue about this outstanding?

No, there is not. I just stumbled across this while grappling with the NaN in integer Index.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 15, 2017

@jreback : The Appveyor build is broken for everybody due to conda conflicts.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2017

The Appveyor build is broken for everybody due to conda conflicts.

hmm, that's not friendly

@gfyoung gfyoung force-pushed the generic-where-gen-array branch 2 times, most recently from bf3bacd to f821bab Compare February 15, 2017 18:50
@jreback
Copy link
Contributor

jreback commented Feb 15, 2017

I think the miniconda installer changed to 3.6.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 15, 2017

I think the miniconda installer changed to 3.6.

Yikes...though the installer website page doesn't seem to reflect that if that's true...

@jreback
Copy link
Contributor

jreback commented Feb 15, 2017

neah its not that. I had fixed the conda installer a while back at 4.2.15. So was installing the latest miniconda, but an incompat version of conda :<

switching to latest fixes. pushing soon.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2017

ok rebase and windows should be good.

@gfyoung gfyoung force-pushed the generic-where-gen-array branch from f821bab to a58e322 Compare February 15, 2017 21:12
@jreback jreback added this to the 0.20.0 milestone Feb 16, 2017
@gfyoung gfyoung force-pushed the generic-where-gen-array branch from a58e322 to 52c60d6 Compare February 16, 2017 23:17
@codecov-io
Copy link

codecov-io commented Feb 17, 2017

Codecov Report

Merging #15414 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15414      +/-   ##
==========================================
+ Coverage   90.36%   90.36%   +<.01%     
==========================================
  Files         136      136              
  Lines       49532    49540       +8     
==========================================
+ Hits        44760    44768       +8     
  Misses       4772     4772
Impacted Files Coverage Δ
pandas/indexes/base.py 96.22% <ø> (ø)
pandas/core/generic.py 96.31% <100%> (+0.01%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81c57e2...5037932. Read the comment docs.

@@ -4717,24 +4717,41 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
applied as a function even if callable. Used in __setitem__.
"""
inplace = validate_bool_kwarg(inplace, 'inplace')
inplace = True if inplace else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this just makes things confusing, do this below in a condiationl, don't modify inplace

#
# Conveniently, 'inplace' matches the boolean with
# which we want to fill.
cond = cond.fillna(inplace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do a conditional (or do bool(inplace) I guess).

msg = "Boolean array expected for the condition, not {dtype}"

if not isinstance(cond, pd.DataFrame):
# This is a single-dimensional object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of this appears to be tested.

Copy link
Member Author

@gfyoung gfyoung Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)

I saw your comment above, and I have yet to write those tests. In addition, this branch gets hit all the time for the current .where tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but not specifically for validation boolean-like inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. This branch gets hit whenever you call Series.where

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have not tests for the condition you added at all.

In [1]: s = Series([1,2,3])

In [2]: s.where(s+1)
Out[2]: 
0    1
1    2
2    3
dtype: int64

makes no sense and is not tested

sure this gets hit all the time with expected input. but I suspect is barely tests with unexpected input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting to it with the tests for invalid input. That example should break if you pull down my changes at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great, thanks @gfyoung ping when you want review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Tests have been added. Have a look.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@gfyoung gfyoung force-pushed the generic-where-gen-array branch 2 times, most recently from 94d3cee to 5f755a1 Compare February 18, 2017 21:03
if not is_bool_dtype(dt):
raise ValueError(msg.format(dtype=dt))

cond = cond.astype(bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add copy=False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

msg = "Boolean array expected for the condition"

conds = [
[[1], [0], [1]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add some that have an invalid shape? e.g. a scalar, an unalignable Series, a 1-dim ndarray (or is that broadcast)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends in fact on the type of input you pass in. If you pass in an NDFrame instance for cond, we will "align" it with self, and there is no error. However, if the instance of that same input is not NDFrame, we will error due to misalignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so this would be good to test then :>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do once this recent commit has been tested on Travis and Appveyor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Tests have been added, and everything looks good. Have a look.

msg = "Boolean array expected for the condition"

conds = [
[1, 0, 1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a DataFrame as input (maybe that works if the names match?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works if names match. Fails if they do not. This is because we "align" the cond input with self before applying it, so if the names don't match cond becomes a DataFrame of NaN.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, test was added just now. Just waiting for Travis and Appveyor to approve now.

@gfyoung gfyoung force-pushed the generic-where-gen-array branch 4 times, most recently from fda5980 to 5cb249e Compare February 21, 2017 08:52
# If 'inplace' is True, we want to fill with True
# before inverting. If 'inplace' is False, we will
# fill with False and do nothing else.
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really strange. Why are you not always filing with True? and if not please show me an example that depends on this.

Copy link
Member Author

@gfyoung gfyoung Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of tests break, even basic ones like this:

>>> s = Series([1, 2, 3])
>>> s.where([True, False, True])
>>> s
0    NaN
1    NaN
2    NaN
dtype: float64

if not is_bool_dtype(dt):
raise ValueError(msg.format(dtype=dt))

cond = cond.astype(bool, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you are flipping this again here? (according to inplace). This is really convoluted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but this change is just a refactor of what was there before. Also, test break if we don't do that inversion, as you saw above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but your refactor is now much more confusing. can you try to make this simpler?

Copy link
Member Author

@gfyoung gfyoung Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, are you saying the original code on master was simpler to understand!? I fail to see how that is the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, your new code is VERY confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't really answer my question.


conds = [
[[1], [0], [1]],
Series([[2], [5], [7]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you use a Series with a list-like? that is really odd here (even to test), just use a regular alignable, non-boolean series.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because a list-of-lists is already aligned with DataFrame, and I'm just wrapping it with Series. The only thing I want to be invalid about this input is that it has numeric (instead of boolean) values. I don't want to also test that the Series object gets aligned here.

df['b'] = 2
conds = [
[[0, 1], [1, 0], [1, 1]],
Series([[0, 2], [5, 0], [4, 7]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer as above

[[0, 1], [1, 0], [1, 1]],
Series([[0, 2], [5, 0], [4, 7]]),
[["False", "True"], ["True", "False"],
["True", "True"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try at least a DataFrame as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the generic-where-gen-array branch 3 times, most recently from ae72c0a to c7634ae Compare February 22, 2017 16:49
# fill with False and do nothing else.
#
# Conveniently, 'inplace' matches the boolean with
# which we want to fill.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment in particular is very very odd. what does inplace have to do with the value you fill?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code on master should illuminate how I refactored everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that you are using the inplace variable in a completely odd way. It needs to be factored to avoid that. Its just not readable code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I rewrote it slightly to hopefully improve readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Created fill_value variable to make the code more readable. Have a look.

@gfyoung gfyoung force-pushed the generic-where-gen-array branch 3 times, most recently from d2b951d to fe4753a Compare February 23, 2017 20:27
@gfyoung gfyoung force-pushed the generic-where-gen-array branch from fe4753a to 5037932 Compare February 24, 2017 15:22
@jreback jreback closed this in 7e0a71b Feb 24, 2017
@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

thanks. I looked at the previous code and realize that inplace was being used similarly to how you are using it.

@gfyoung gfyoung deleted the generic-where-gen-array branch February 24, 2017 20:36

expected = DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan]])

out = df.where(Series(cond))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already worked like this before I think, but since we explicitly added a test for it: how can this result be explained? The Series aligned on the index and broadcasted along the columns? That is not how alignment otherwise works .. (alignment of cond is also not mentioned in the docstring)

Copy link
Member Author

@gfyoung gfyoung Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NDFrame conditionals, we call .align so that no error is raised since the cond becomes aligned with self.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But those do not just align (unless you specify the axis on which to align):

In [32]: df = DataFrame([[1, 2, 3], [4, 5, 6]])

In [33]: cond = [True]

In [34]: df.align(Series(cond))
...
ValueError: Must specify axis=0 or 1

There is an axis keyword in where, but that does not seem to have any effect in this case.

So it seems that where aligns on the index, as opposed to eg other operators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that. The code does it internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't doubt that you know that, but my initial question was: how do we explain this (to users)? This is not documented, this is IMO inconsistent, and one would expect the axis keyword can influence this.

Copy link
Member Author

@gfyoung gfyoung Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : Actually, no. Regardless of axis, we are able to align correctly and perform .where. I should add that this behavior existed before my patch and that @jreback just requested that I add tests to check existing behavior.

Whether or not this behavior is correct is another story. I can see your point regarding this inconsistency, but mind you that changing it to err would be an API change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know this behaviour was already like that before your patch, as I mentioned in my initial comment. I just raised this because we where explicitly adding a test for it (while the behaviour could also have a case of 'undefined/uncatched behaviour' that was not explicitly intended).

But probably better to discuss in separate issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other question: what is the purpose of the axis keyword?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see in the code this is used to align other, not cond.
I think it is rather confusing that cond and other are aligned in a different way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But probably better to discuss in separate issue.

Agreed.


.. versionadded:: 0.18.1

A callable can be used as cond.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belonged to the versionadded directive, so I don't think it should be removed, as now the versionadded indication is not really correct (or remove both)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally found that confusing. I wouldn't have known that from reading the original doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the html docs it is slightly clearer. But indeed, for a correct syntax, the empty line should not be there, like

.. versionadded:: 0.18.1
    A callable can be used as cond.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : PR up to patch (#15509)

@jreback
Copy link
Contributor

jreback commented Feb 25, 2017

I think the alignment in .where is a mistake, but it probably is never used. We don't do alignment on boolean things anywhere else. So should remove this

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Author: gfyoung <[email protected]>

Closes pandas-dev#15414 from gfyoung/generic-where-gen-array and squashes the following commits:

5037932 [gfyoung] BUG: Accept generic array-like in .where
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants