-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Accept Generic Array-Like for .where #15414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
129561c
to
7288245
Compare
The callable must not change input %(klass)s | ||
(though pandas doesn't check it). | ||
should return boolean %(klass)s or array. The callable must | ||
not change input %(klass)s (though pandas doesn't check it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add 0.20.0 now supports array-like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't consider it an enhancement. We generally accept array-like across the board.
pandas/tests/frame/test_indexing.py
Outdated
@@ -2479,6 +2479,25 @@ def _check_set(df, cond, check_dtypes=True): | |||
expected = df[df['a'] == 1].reindex(df.index) | |||
assert_frame_equal(result, expected) | |||
|
|||
def test_where_array_like(self): | |||
klasses = [list, tuple, np.array] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
df['b'] = 2 | ||
expected['b'] = [2, np.nan, 2] | ||
cond = [[False, True], [True, False], [True, True]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung everything looks lgtm. Can you add some tests for rejected things, ping on green.
@@ -497,6 +497,18 @@ def test_where(self): | |||
result = i.where(cond) | |||
tm.assert_index_equal(result, expected) | |||
|
|||
def test_where_array_like(self): | |||
i = self.create_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just confirmation tests. The behavior was never broken for indices.
don't suppose there is an issue about this outstanding? |
No, there is not. I just stumbled across this while grappling with the |
@jreback : The Appveyor build is broken for everybody due to |
hmm, that's not friendly |
bf3bacd
to
f821bab
Compare
I think the miniconda installer changed to 3.6. |
Yikes...though the installer website page doesn't seem to reflect that if that's true... |
neah its not that. I had fixed the conda installer a while back at 4.2.15. So was installing the latest miniconda, but an incompat version of conda :< switching to latest fixes. pushing soon. |
ok rebase and windows should be good. |
f821bab
to
a58e322
Compare
a58e322
to
52c60d6
Compare
Codecov Report
@@ Coverage Diff @@
## master #15414 +/- ##
==========================================
+ Coverage 90.36% 90.36% +<.01%
==========================================
Files 136 136
Lines 49532 49540 +8
==========================================
+ Hits 44760 44768 +8
Misses 4772 4772
Continue to review full report at Codecov.
|
pandas/core/generic.py
Outdated
@@ -4717,24 +4717,41 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |||
applied as a function even if callable. Used in __setitem__. | |||
""" | |||
inplace = validate_bool_kwarg(inplace, 'inplace') | |||
inplace = True if inplace else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this just makes things confusing, do this below in a condiationl, don't modify inplace
pandas/core/generic.py
Outdated
# | ||
# Conveniently, 'inplace' matches the boolean with | ||
# which we want to fill. | ||
cond = cond.fillna(inplace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do a conditional (or do bool(inplace)
I guess).
msg = "Boolean array expected for the condition, not {dtype}" | ||
|
||
if not isinstance(cond, pd.DataFrame): | ||
# This is a single-dimensional object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of this appears to be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for rejected things (e.g. non-boolean list-like, Timestamp, string, etcc)
I saw your comment above, and I have yet to write those tests. In addition, this branch gets hit all the time for the current .where
tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, but not specifically for validation boolean-like inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here. This branch gets hit whenever you call Series.where
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these have not tests for the condition you added at all.
In [1]: s = Series([1,2,3])
In [2]: s.where(s+1)
Out[2]:
0 1
1 2
2 3
dtype: int64
makes no sense and is not tested
sure this gets hit all the time with expected input. but I suspect is barely tests with unexpected input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting to it with the tests for invalid input. That example should break if you pull down my changes at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great, thanks @gfyoung ping when you want review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : Tests have been added. Have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
94d3cee
to
5f755a1
Compare
pandas/core/generic.py
Outdated
if not is_bool_dtype(dt): | ||
raise ValueError(msg.format(dtype=dt)) | ||
|
||
cond = cond.astype(bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add copy=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
msg = "Boolean array expected for the condition" | ||
|
||
conds = [ | ||
[[1], [0], [1]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add some that have an invalid shape? e.g. a scalar, an unalignable Series, a 1-dim ndarray (or is that broadcast)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends in fact on the type of input you pass in. If you pass in an NDFrame
instance for cond
, we will "align" it with self
, and there is no error. However, if the instance of that same input is not NDFrame
, we will error due to misalignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so this would be good to test then :>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do once this recent commit has been tested on Travis and Appveyor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : Tests have been added, and everything looks good. Have a look.
msg = "Boolean array expected for the condition" | ||
|
||
conds = [ | ||
[1, 0, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a DataFrame as input (maybe that works if the names match?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works if names match. Fails if they do not. This is because we "align" the cond
input with self
before applying it, so if the names don't match cond
becomes a DataFrame
of NaN
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, test was added just now. Just waiting for Travis and Appveyor to approve now.
fda5980
to
5cb249e
Compare
pandas/core/generic.py
Outdated
# If 'inplace' is True, we want to fill with True | ||
# before inverting. If 'inplace' is False, we will | ||
# fill with False and do nothing else. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really strange. Why are you not always filing with True? and if not please show me an example that depends on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of tests break, even basic ones like this:
>>> s = Series([1, 2, 3])
>>> s.where([True, False, True])
>>> s
0 NaN
1 NaN
2 NaN
dtype: float64
if not is_bool_dtype(dt): | ||
raise ValueError(msg.format(dtype=dt)) | ||
|
||
cond = cond.astype(bool, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you are flipping this again here? (according to inplace). This is really convoluted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but this change is just a refactor of what was there before. Also, test break if we don't do that inversion, as you saw above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but your refactor is now much more confusing. can you try to make this simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, are you saying the original code on master
was simpler to understand!? I fail to see how that is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, your new code is VERY confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't really answer my question.
|
||
conds = [ | ||
[[1], [0], [1]], | ||
Series([[2], [5], [7]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would you use a Series with a list-like? that is really odd here (even to test), just use a regular alignable, non-boolean series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because a list-of-lists is already aligned with DataFrame
, and I'm just wrapping it with Series
. The only thing I want to be invalid about this input is that it has numeric (instead of boolean) values. I don't want to also test that the Series
object gets aligned here.
df['b'] = 2 | ||
conds = [ | ||
[[0, 1], [1, 0], [1, 1]], | ||
Series([[0, 2], [5, 0], [4, 7]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer as above
[[0, 1], [1, 0], [1, 1]], | ||
Series([[0, 2], [5, 0], [4, 7]]), | ||
[["False", "True"], ["True", "False"], | ||
["True", "True"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try at least a DataFrame as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ae72c0a
to
c7634ae
Compare
pandas/core/generic.py
Outdated
# fill with False and do nothing else. | ||
# | ||
# Conveniently, 'inplace' matches the boolean with | ||
# which we want to fill. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment in particular is very very odd. what does inplace have to do with the value you fill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code on master
should illuminate how I refactored everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is that you are using the inplace variable in a completely odd way. It needs to be factored to avoid that. Its just not readable code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I rewrote it slightly to hopefully improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : Created fill_value
variable to make the code more readable. Have a look.
d2b951d
to
fe4753a
Compare
fe4753a
to
5037932
Compare
thanks. I looked at the previous code and realize that |
|
||
expected = DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan]]) | ||
|
||
out = df.where(Series(cond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already worked like this before I think, but since we explicitly added a test for it: how can this result be explained? The Series aligned on the index and broadcasted along the columns? That is not how alignment otherwise works .. (alignment of cond
is also not mentioned in the docstring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NDFrame
conditionals, we call .align
so that no error is raised since the cond
becomes aligned with self
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But those do not just align (unless you specify the axis on which to align):
In [32]: df = DataFrame([[1, 2, 3], [4, 5, 6]])
In [33]: cond = [True]
In [34]: df.align(Series(cond))
...
ValueError: Must specify axis=0 or 1
There is an axis keyword in where
, but that does not seem to have any effect in this case.
So it seems that where
aligns on the index, as opposed to eg other operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that. The code does it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't doubt that you know that, but my initial question was: how do we explain this (to users)? This is not documented, this is IMO inconsistent, and one would expect the axis
keyword can influence this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche : Actually, no. Regardless of axis
, we are able to align correctly and perform .where
. I should add that this behavior existed before my patch and that @jreback just requested that I add tests to check existing behavior.
Whether or not this behavior is correct is another story. I can see your point regarding this inconsistency, but mind you that changing it to err would be an API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know this behaviour was already like that before your patch, as I mentioned in my initial comment. I just raised this because we where explicitly adding a test for it (while the behaviour could also have a case of 'undefined/uncatched behaviour' that was not explicitly intended).
But probably better to discuss in separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other question: what is the purpose of the axis
keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see in the code this is used to align other
, not cond
.
I think it is rather confusing that cond
and other
are aligned in a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But probably better to discuss in separate issue.
Agreed.
|
||
.. versionadded:: 0.18.1 | ||
|
||
A callable can be used as cond. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belonged to the versionadded
directive, so I don't think it should be removed, as now the versionadded indication is not really correct (or remove both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally found that confusing. I wouldn't have known that from reading the original doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the html docs it is slightly clearer. But indeed, for a correct syntax, the empty line should not be there, like
.. versionadded:: 0.18.1
A callable can be used as cond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche : PR up to patch (#15509)
I think the alignment in |
Author: gfyoung <[email protected]> Closes pandas-dev#15414 from gfyoung/generic-where-gen-array and squashes the following commits: 5037932 [gfyoung] BUG: Accept generic array-like in .where
Setup:
Before:
After:
Note that
Index
objects can accept array-like objects as conditionals, and I wrote tests to confirm.