-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add support for multiple conditions assign statement #50343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3b775e3
to
7b98fb5
Compare
7b98fb5
to
199c947
Compare
I am not sure if I would've implemented it with an anonymous function (eg. lambda). Since |
@erfannariman Thanks for taking a look.
Indeed, here is a fixup: 7ad6b10
I didn't understand your statement fully, can you please elaborate? |
pandas/core/case_when.py
Outdated
default : Any, default is `None`. | ||
The default value to be used if all conditions evaluate False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When default is not specified, I believe the current implementation will convert e.g. int to object dtypes. That seems undesirable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. For now, we can use np.nan
. Pushed a fixup: be5a3b8
Using np.nan
also preserves date related dtypes properly. To illustrate:
>>> df = pd.DataFrame(dict(a=[1, 2, 3]))
>>> result = df.assign(
... new_column=pd.case_when(
... lambda x: x.a == 1, pd.Timestamp("today")
... )
... )
>>> result.new_column
0 2023-01-07 18:58:28.654455
1 NaT
2 NaT
Name: new_column, dtype: datetime64[ns]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the default of np.nan will work for all dtypes, e.g. in my experience it's common (but not universal) to have None instead of np.nan be the NA value for string. Leaving it as np.nan might unexpectedly coerce the dtype on some inputs as well, surprising users.
With this, it seems better to me to have default
be a required argument.
Also, how does the performance compare between:
df['a'].mask(cond1, value1).mask(~cond1 & cond2, value2)
pd.case_when(cond1, value1, cond2, value2, default=df['a'])(df)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this, it seems better to me to have default be a required argument.
Good idea, done.
Also, how does the performance compare between:
Is there a need to check this performance specially since case_when
uses Series.mask
now?
pandas/core/case_when.py
Outdated
import pandas.core.common as com | ||
|
||
|
||
def case_when(*args, default: Any = lib.no_default) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users wanting to use this without assign would be calling something like
pd.case_when(cond1, val1, cond2, val2)(df)
That seems odd, and it's much more natural to have the DataFrame as an argument. On the other hand, that wouldn't allow as much usage with assign via method chaining if the DataFrame is changed as part of the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, that wouldn't allow as much usage with assign via method chaining if the DataFrame is changed as part of the chain.
Exactly, what do you suggest in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should start adding functions to the API just to support method chaining in assign. Users can still use this with a simple wrapper, e.g.
def chain_case_when(df, *args, **kwargs):
return lambda df: pd.case_when(df, *args, **kwargs)
result = (
...
.assign(new_col=chain_case_when(lambda x: x.a == 1, 'first', lambda x: (x.a > 1), 'second', default='default'))
)
A few other thoughts on the API. A few of these repeat other comments here, just adding so it's all in one place.
- I think the
default
argument should have no default (must be user-specified) ENH: Add support for multiple conditions assign statement #50343 (comment) - Return value should be pandas object (Series)
- The top level method is necessary for constructing a Series where the default is a scalar. However it also seems natural to also have
Series.case_when
wheredefault
is the provided Series (so not an argument in the signature). We could also haveDataFrame.case_when
be an alias forpd.case_when(df, ...)
but that doesn't seem necessary. - Needs to support EAs (ENH: Add support for multiple conditions assign statement #50343 (comment))
- Does the current implementation support replacements (values) being Series? I think the answer is yes, but wanted to be sure. This should be added to docs and tests.
The tests should also be expanded:
- condition being the Boolean EA
- dtypes for values including bool, string, and EAs (numeric, categorical, datetimelike, period)
- Malformed arguments. E.g. what happens when a condition is something unexpected like a scalar or a list / ndarray / Series that is the wrong length or not aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally feel pd.Series.mask should be used instead of np.select. the implementation of pd.Series.mask takes care of a lot of things, including dtypes and alignment.
Having case_when as a function makes it versatile, one can use it within assign or on its own. Agreed that it should return a pandas object, preferably a series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on using mask (or mask-like internals if there is unnecessary logic that is wasting cycles)
Having case_when as a function makes it versatile
Just for clarification, is this opposition to also having Series.case_when
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since pd.case_when returns a pandas object I don't see a need for series.case_when( although it is convenient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it's not logically necessary, but it seems to be a common use case and using the a top level function is unnecessarily redundant:
# I'm assuming case_when supports Series
pd.case_when(ser, cond1, val1, cond2, val2, default=ser)
as opposed to
ser.case_when(cond1, val1, cond2, val2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly see #50343 (comment)
be5a3b8
to
216a36f
Compare
pandas/core/case_when.py
Outdated
|
||
for index, value in enumerate(args): | ||
if not index % 2: | ||
if callable(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can skip the callable check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the switch to Series.mask
(see the switch here), checking for callables is necessary to support passing conditions to case_when
as a callable.
pandas/core/case_when.py
Outdated
else: | ||
replacements.append(value) | ||
|
||
return np.select(booleans, replacements, default=default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is an extension dtype? I don't think numpy select handles that well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly see #50343 (comment)
216a36f
to
54d92f6
Compare
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@ELHoussineT how's it going? |
Sorry I was away. Will push commits this weekend. @samukweku |
* Used to support conditional assignment operation.
54d92f6
to
45352da
Compare
@rhshadrach How is it looking, shall I proceed and write the tests and docs to wrap up this PR? |
pandas/core/case_when.py
Outdated
if is_array_like(default): | ||
series = pd.Series(default).reset_index(drop=True) | ||
else: | ||
series = pd.Series([default] * obj.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reseting the index, can you do pd.Series(default, index=self.index)
. This can be done in the else
clause as well. Then I think you can remove the resetting of the index below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use is_array_like over is_list_like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use is_array_like over is_list_like?
You are right, updated. fd12a30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reseting the index, can you do pd.Series(default, index=self.index) . This can be done in the else clause as well. Then I think you can remove the resetting of the index below.
If we do so (see b897b8e), then the following case will have an undesired behavior as illustrated in this example:
>>> df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]), index=['index 1', 'index 2', 'index 3'])
>>> df
a b
index 1 1 4
index 2 2 5
index 3 3 6
>>> pd.case_when(
... df,
... lambda x: (x.a == 1) & (x.b == 4),
... pd.Series([-1,-1,-1], index=['other index 1', 'other index 1', 'other index 1']),
... default=0,
... )
index 1 NaN
index 2 0.0
index 3 0.0
dtype: float64
As a result, I overwrote the index and produced a warning to not shock the user with this index overriding. See d01d6e8. This behavior is similar to what was done in other places in Pandas. For example:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the logic looks good, kindly indicate so as I will proceed then to write the tests and docs. Otherwise, we can also continue to iterate on the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is similar to what was done in other places in Pandas.
The behavior you're highlighting here is pandas aligning the index to self
's index. The logic in this PR entirely ignores the index. I would expect pandas to align the index, so that the NaN value in the first row is expected.
As an aside, I also don't think we should be warning in the link above, but that's a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I see. What do you suggest in this case?
pandas/core/case_when.py
Outdated
if is_list_like(default): | ||
series = pd.Series(default.values, index=obj.index) | ||
else: | ||
series = pd.Series([default] * obj.shape[0], index=obj.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think use pd.Series(default, index=obj.index)
in both cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Updated eb70eaf.
pandas/core/case_when.py
Outdated
if isinstance(replacements, pd.Series) and not replacements.index.equals( | ||
obj.index | ||
): | ||
replacements = warn_and_override_index( | ||
replacements, f"(in args[{i+1}])", obj.index | ||
) | ||
|
||
if isinstance(conditions, pd.Series) and not conditions.index.equals(obj.index): | ||
conditions = warn_and_override_index( | ||
conditions, f"(in args[{i}])", obj.index | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this entirely, mask will align.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated. da79aa6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take another look, if it looks good, I will proceed to tests and docs. Thanks.
@rhshadrach PTAL. |
pandas/core/case_when.py
Outdated
replacements = args[i + 1] | ||
|
||
# `Series.mask` call | ||
series = series.mask(conditions, replacements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we want to only modify the value on the first conditions
that evaluates to True (correct me if you think this is wrong!); this will modify it for all conditions. One approach is to maintain a modified
that starts as pd.Series(False, index=self.index)
. Then this line can become series = series.mask(~modified & conditions, replacements)
. After this line, we also need to update modified |= conditions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Updated (f2bf4d8).
Now, if multiple conditions are met, the value of the first one is used. For example:
>>> df
a b
index 1 1 4
index 2 2 5
index 3 3 6
>>> pd.case_when(
... df,
... lambda x: x.a > 0,
... 1,
... lambda x: x.a == 1,
... -1,
... default='default'
)
index 1 1
index 2 1
index 3 1
dtype: object
@rhshadrach I also had to add This is necessary since the default always determines the first dtype of the series even if the final series does not have any default value. To illustrate, see [1] which does not have [1] :
[2] :
|
Thanks for identifying this, it is indeed something I had not considered. I do not think Int64 is the right answer in the example you gave. As values of the input change, the dtype of the result can change, which can then lead to radically different answers later on. It is an example of "values-dependent behavior" which we try to avoid if at all possible. I think we should be using something like |
May I ask why? It is still not super clear to me why can't we proceed with Thanks |
I included why I think to be the case in my initial response. |
…ment operation." This reverts commit a5d678f.
Understood and I agree. In this case, I would suggest to remove The reason why I suggested the above is that this issue is built in To illustrate how the issue is in
Keeping that behavior as is in I am also happy to tackle this issue in Looking forward to your reply. If you agree with the above, please indicate that so I proceed with tests & docs. Best, |
@rhshadrach FYI #52662 |
@rhshadrach PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good, but this needs many additional tests. In particular, we should be testing the behavior when invalid inputs are given.
Assignment based on multiple conditions | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The ``pd.case_when`` API has now been added to support assignment based on multiple conditions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a proper link to the API documentation: :func:`case_when`
Also, can you refer to this as a function rather than an API.
|
||
from pandas.util._exceptions import find_stack_level | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't import all of pandas, only what you need from various modules.
|
||
def case_when(obj: pd.DataFrame | pd.Series, *args, default: Any) -> pd.Series: | ||
""" | ||
Returns a Series based on multiple conditions assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say "Construct" instead of Returns. Also, "multiple conditions assignment" sounds off to me, I would recommend just "multiple conditions".
""" | ||
Returns a Series based on multiple conditions assignment. | ||
|
||
This is useful when you want to assign a column based on multiple conditions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be used independently of assigning a column, I suggest this be removed.
Returns a Series based on multiple conditions assignment. | ||
|
||
This is useful when you want to assign a column based on multiple conditions. | ||
Uses `Series.mask` to perform the assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an implementation detail.
else: | ||
conditions = args[i] | ||
|
||
# get replacements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
# get replacements | ||
replacements = args[i + 1] | ||
|
||
# `Series.mask` call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
# `Series.mask` call | ||
series = series.mask(~modified & conditions, replacements) | ||
|
||
# keeping track of which row got modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
@@ -0,0 +1,61 @@ | |||
import numpy as np | |||
import pytest # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Include the error code when using noqa.
import pandas._testing as tm | ||
|
||
|
||
class TestCaseWhen: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not making use of the class, then use test functions instead of methods.
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
Used to support conditional assignment operation.
Usage example:
Continued discussion here.
Credit:
Some of the logic is inspired by this implementation which at the time of constructing the PR was contributed to by @samukweku @Zeroto521 @thatlittleboy and @ericmjl
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.