Skip to content

ENH: Add support for multiple conditions assign statement #50343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

ELHoussineT
Copy link

@ELHoussineT ELHoussineT commented Dec 19, 2022

Used to support conditional assignment operation.

Usage example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,            # first condition, result   
                           (other_df.z > 4) | (other_df.x.isna()), 1, # second condition, result 
                           "some_value"))                             # default (optional)

Continued discussion here.

Credit:
Some of the logic is inspired by this implementation which at the time of constructing the PR was contributed to by @samukweku @Zeroto521 @thatlittleboy and @ericmjl

@ELHoussineT ELHoussineT marked this pull request as draft December 19, 2022 14:23
@ELHoussineT ELHoussineT force-pushed the conditional-assignment branch 5 times, most recently from 3b775e3 to 7b98fb5 Compare January 6, 2023 11:10
@ELHoussineT ELHoussineT marked this pull request as ready for review January 6, 2023 11:10
@ELHoussineT ELHoussineT force-pushed the conditional-assignment branch from 7b98fb5 to 199c947 Compare January 6, 2023 11:10
@erfannariman
Copy link
Member

I am not sure if I would've implemented it with an anonymous function (eg. lambda). Since np.select expects boolean arrays, so we could just use a standard pandas condition df["a"] == "some value"

@ELHoussineT
Copy link
Author

ELHoussineT commented Jan 6, 2023

@erfannariman Thanks for taking a look.

I am not sure if I would've implemented it with an anonymous function (eg. lambda).

Indeed, here is a fixup: 7ad6b10

Since np.select expects boolean arrays, so we could just use a standard pandas condition df["a"] == "some value"

I didn't understand your statement fully, can you please elaborate?

Comment on lines 29 to 30
default : Any, default is `None`.
The default value to be used if all conditions evaluate False.
Copy link
Member

@rhshadrach rhshadrach Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When default is not specified, I believe the current implementation will convert e.g. int to object dtypes. That seems undesirable to me.

Copy link
Author

@ELHoussineT ELHoussineT Jan 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. For now, we can use np.nan. Pushed a fixup: be5a3b8

Using np.nan also preserves date related dtypes properly. To illustrate:

>>> df = pd.DataFrame(dict(a=[1, 2, 3]))
>>> result = df.assign(
...     new_column=pd.case_when(
...         lambda x: x.a == 1, pd.Timestamp("today")
...     )
... )
>>> result.new_column
0   2023-01-07 18:58:28.654455
1                          NaT
2                          NaT
Name: new_column, dtype: datetime64[ns]

Copy link
Member

@rhshadrach rhshadrach Jan 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the default of np.nan will work for all dtypes, e.g. in my experience it's common (but not universal) to have None instead of np.nan be the NA value for string. Leaving it as np.nan might unexpectedly coerce the dtype on some inputs as well, surprising users.

With this, it seems better to me to have default be a required argument.

Also, how does the performance compare between:

df['a'].mask(cond1, value1).mask(~cond1 & cond2, value2)

pd.case_when(cond1, value1, cond2, value2, default=df['a'])(df)

Copy link
Author

@ELHoussineT ELHoussineT Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, it seems better to me to have default be a required argument.

Good idea, done.

Also, how does the performance compare between:

Is there a need to check this performance specially since case_when uses Series.mask now?

import pandas.core.common as com


def case_when(*args, default: Any = lib.no_default) -> Callable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users wanting to use this without assign would be calling something like

pd.case_when(cond1, val1, cond2, val2)(df)

That seems odd, and it's much more natural to have the DataFrame as an argument. On the other hand, that wouldn't allow as much usage with assign via method chaining if the DataFrame is changed as part of the chain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, that wouldn't allow as much usage with assign via method chaining if the DataFrame is changed as part of the chain.

Exactly, what do you suggest in this case?

Copy link
Member

@rhshadrach rhshadrach Jan 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should start adding functions to the API just to support method chaining in assign. Users can still use this with a simple wrapper, e.g.

def chain_case_when(df, *args, **kwargs):
    return lambda df: pd.case_when(df, *args, **kwargs)

result = (
    ...
    .assign(new_col=chain_case_when(lambda x: x.a == 1, 'first', lambda x: (x.a > 1), 'second', default='default'))
)

A few other thoughts on the API. A few of these repeat other comments here, just adding so it's all in one place.

  • I think the default argument should have no default (must be user-specified) ENH: Add support for multiple conditions assign statement #50343 (comment)
  • Return value should be pandas object (Series)
  • The top level method is necessary for constructing a Series where the default is a scalar. However it also seems natural to also have Series.case_when where default is the provided Series (so not an argument in the signature). We could also have DataFrame.case_when be an alias for pd.case_when(df, ...) but that doesn't seem necessary.
  • Needs to support EAs (ENH: Add support for multiple conditions assign statement #50343 (comment))
  • Does the current implementation support replacements (values) being Series? I think the answer is yes, but wanted to be sure. This should be added to docs and tests.

The tests should also be expanded:

  • condition being the Boolean EA
  • dtypes for values including bool, string, and EAs (numeric, categorical, datetimelike, period)
  • Malformed arguments. E.g. what happens when a condition is something unexpected like a scalar or a list / ndarray / Series that is the wrong length or not aligned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally feel pd.Series.mask should be used instead of np.select. the implementation of pd.Series.mask takes care of a lot of things, including dtypes and alignment.

Having case_when as a function makes it versatile, one can use it within assign or on its own. Agreed that it should return a pandas object, preferably a series.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on using mask (or mask-like internals if there is unnecessary logic that is wasting cycles)

Having case_when as a function makes it versatile

Just for clarification, is this opposition to also having Series.case_when?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pd.case_when returns a pandas object I don't see a need for series.case_when( although it is convenient).

Copy link
Member

@rhshadrach rhshadrach Jan 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's not logically necessary, but it seems to be a common use case and using the a top level function is unnecessarily redundant:

# I'm assuming case_when supports Series
pd.case_when(ser, cond1, val1, cond2, val2, default=ser)

as opposed to

ser.case_when(cond1, val1, cond2, val2)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly see #50343 (comment)

@ELHoussineT ELHoussineT force-pushed the conditional-assignment branch 3 times, most recently from be5a3b8 to 216a36f Compare January 7, 2023 21:37

for index, value in enumerate(args):
if not index % 2:
if callable(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can skip the callable check

Copy link
Author

@ELHoussineT ELHoussineT Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the switch to Series.mask (see the switch here), checking for callables is necessary to support passing conditions to case_when as a callable.

else:
replacements.append(value)

return np.select(booleans, replacements, default=default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is an extension dtype? I don't think numpy select handles that well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly see #50343 (comment)

@ELHoussineT ELHoussineT force-pushed the conditional-assignment branch from 216a36f to 54d92f6 Compare January 22, 2023 18:56
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 22, 2023
@samukweku
Copy link
Contributor

@ELHoussineT how's it going?

@ELHoussineT
Copy link
Author

Sorry I was away. Will push commits this weekend. @samukweku

@ELHoussineT ELHoussineT force-pushed the conditional-assignment branch from 54d92f6 to 45352da Compare March 8, 2023 14:17
@ELHoussineT
Copy link
Author

@rhshadrach How is it looking, shall I proceed and write the tests and docs to wrap up this PR?

Comment on lines 142 to 145
if is_array_like(default):
series = pd.Series(default).reset_index(drop=True)
else:
series = pd.Series([default] * obj.shape[0])
Copy link
Member

@rhshadrach rhshadrach Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reseting the index, can you do pd.Series(default, index=self.index) . This can be done in the else clause as well. Then I think you can remove the resetting of the index below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use is_array_like over is_list_like?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use is_array_like over is_list_like?

You are right, updated. fd12a30

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reseting the index, can you do pd.Series(default, index=self.index) . This can be done in the else clause as well. Then I think you can remove the resetting of the index below.

If we do so (see b897b8e), then the following case will have an undesired behavior as illustrated in this example:

>>> df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]), index=['index 1', 'index 2', 'index 3'])
>>> df
         a  b
index 1  1  4
index 2  2  5
index 3  3  6
>>> pd.case_when(
...     df,
...     lambda x: (x.a == 1) & (x.b == 4),
...     pd.Series([-1,-1,-1], index=['other index 1', 'other index 1', 'other index 1']),
...     default=0,
... )
index 1    NaN
index 2    0.0
index 3    0.0
dtype: float64

As a result, I overwrote the index and produced a warning to not shock the user with this index overriding. See d01d6e8. This behavior is similar to what was done in other places in Pandas. For example:

https://github.com/ELHoussineT/pandas/blob/d01d6e897e0a567f9d9c8e3f4bc5fc3aa18d6fe8/pandas/core/frame.py#L3800-L3805

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logic looks good, kindly indicate so as I will proceed then to write the tests and docs. Otherwise, we can also continue to iterate on the logic.

Copy link
Member

@rhshadrach rhshadrach Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is similar to what was done in other places in Pandas.

The behavior you're highlighting here is pandas aligning the index to self's index. The logic in this PR entirely ignores the index. I would expect pandas to align the index, so that the NaN value in the first row is expected.

As an aside, I also don't think we should be warning in the link above, but that's a separate issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I see. What do you suggest in this case?

Comment on lines 154 to 157
if is_list_like(default):
series = pd.Series(default.values, index=obj.index)
else:
series = pd.Series([default] * obj.shape[0], index=obj.index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use pd.Series(default, index=obj.index) in both cases here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Updated eb70eaf.

Comment on lines 169 to 179
if isinstance(replacements, pd.Series) and not replacements.index.equals(
obj.index
):
replacements = warn_and_override_index(
replacements, f"(in args[{i+1}])", obj.index
)

if isinstance(conditions, pd.Series) and not conditions.index.equals(obj.index):
conditions = warn_and_override_index(
conditions, f"(in args[{i}])", obj.index
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this entirely, mask will align.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated. da79aa6

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take another look, if it looks good, I will proceed to tests and docs. Thanks.

@ELHoussineT
Copy link
Author

@rhshadrach PTAL.

replacements = args[i + 1]

# `Series.mask` call
series = series.mask(conditions, replacements)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we want to only modify the value on the first conditions that evaluates to True (correct me if you think this is wrong!); this will modify it for all conditions. One approach is to maintain a modified that starts as pd.Series(False, index=self.index). Then this line can become series = series.mask(~modified & conditions, replacements). After this line, we also need to update modified |= conditions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated (f2bf4d8).

Now, if multiple conditions are met, the value of the first one is used. For example:

>>> df 
         a  b
index 1  1  4
index 2  2  5
index 3  3  6

>>> pd.case_when(
... df,
... lambda x: x.a > 0,
... 1,
... lambda x: x.a == 1, 
... -1,
... default='default'
)
index 1    1
index 2    1
index 3    1
dtype: object

@ELHoussineT
Copy link
Author

@rhshadrach I also had to add .convert_dtypes() a5d678f.

This is necessary since the default always determines the first dtype of the series even if the final series does not have any default value.

To illustrate, see [1] which does not have .convert_dtypes() while [2] has it. Notice how [2] has the correct dtype in the result series.

[1] :

>>> df 
         a  b
index 1  1  4
index 2  2  5
index 3  3  6

>>> pd.case_when(
... df,
... lambda x: x.a > 0,
... 1,
... lambda x: x.a == 1, 
... -1,
... default='default'
)
index 1    1
index 2    1
index 3    1
dtype: object

[2] :

>>> df 
         a  b
index 1  1  4
index 2  2  5
index 3  3  6

>>> pd.case_when(
... df,
... lambda x: x.a > 0,
... 1,
... lambda x: x.a == 1, 
... -1,
... default='default'
)
index 1    1
index 2    1
index 3    1
dtype: Int64

@rhshadrach
Copy link
Member

@rhshadrach I also had to add .convert_dtypes() a5d678f.

This is necessary since the default always determines the first dtype of the series even if the final series does not have any default value.

Thanks for identifying this, it is indeed something I had not considered. I do not think Int64 is the right answer in the example you gave. As values of the input change, the dtype of the result can change, which can then lead to radically different answers later on. It is an example of "values-dependent behavior" which we try to avoid if at all possible.

I think we should be using something like find_common_type to determine what the output dtype is based on the input dtypes alone (and not values!). However I don't know if this is readily used since this function accepts scalars. In short, I'm not sure of what a good solution looks like here.

@ELHoussineT
Copy link
Author

@rhshadrach

I do not think Int64 is the right answer in the example you gave.

May I ask why? It is still not super clear to me why can't we proceed with .convert_dtypes().

Thanks

@rhshadrach
Copy link
Member

May I ask why?

I included why I think to be the case in my initial response.

@ELHoussineT
Copy link
Author

@rhshadrach

Understood and I agree.

In this case, I would suggest to remove .convert_dtypes() (see 072bc0c) and mention this limitation in the docstring (see 747b379).

The reason why I suggested the above is that this issue is built in Series.mask() and keeping the behavior consistent for now might be beneficial.

To illustrate how the issue is in Series.mask(), see the dtype of the result Series below:

>>> s = pd.Series(['a','b'])
0    a
1    b
dtype: object

>>> s.mask([True, True], 1)
0    1
1    1
dtype: object

Keeping that behavior as is in pd.case_when will maintain the consistency (even through one could argue it is an undesired behavior) and if this gets fixed in Series.mask the changes will propagate to pd.case_when as it uses Series.mask under the hood.

I am also happy to tackle this issue in Series.mask after we are done with this PR. I am optimistic that we can find a solution.

Looking forward to your reply. If you agree with the above, please indicate that so I proceed with tests & docs.

Best,

@ELHoussineT
Copy link
Author

@rhshadrach FYI #52662

@ELHoussineT
Copy link
Author

@rhshadrach PTAL.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good, but this needs many additional tests. In particular, we should be testing the behavior when invalid inputs are given.

Assignment based on multiple conditions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``pd.case_when`` API has now been added to support assignment based on multiple conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a proper link to the API documentation: :func:`case_when`

Also, can you refer to this as a function rather than an API.


from pandas.util._exceptions import find_stack_level

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't import all of pandas, only what you need from various modules.


def case_when(obj: pd.DataFrame | pd.Series, *args, default: Any) -> pd.Series:
"""
Returns a Series based on multiple conditions assignment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say "Construct" instead of Returns. Also, "multiple conditions assignment" sounds off to me, I would recommend just "multiple conditions".

"""
Returns a Series based on multiple conditions assignment.

This is useful when you want to assign a column based on multiple conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be used independently of assigning a column, I suggest this be removed.

Returns a Series based on multiple conditions assignment.

This is useful when you want to assign a column based on multiple conditions.
Uses `Series.mask` to perform the assignment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an implementation detail.

else:
conditions = args[i]

# get replacements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

# get replacements
replacements = args[i + 1]

# `Series.mask` call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

# `Series.mask` call
series = series.mask(~modified & conditions, replacements)

# keeping track of which row got modified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@@ -0,0 +1,61 @@
import numpy as np
import pytest # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Include the error code when using noqa.

import pandas._testing as tm


class TestCaseWhen:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not making use of the class, then use test functions instead of methods.

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Aug 25, 2023
This was referenced Sep 27, 2023
@samukweku samukweku mentioned this pull request Nov 19, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Dedicated method for creating conditional columns
6 participants