ENH: Add support for multiple conditions assign statement #50343

ELHoussineT · 2022-12-19T14:23:13Z

Used to support conditional assignment operation.

Usage example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,            # first condition, result   
                           (other_df.z > 4) | (other_df.x.isna()), 1, # second condition, result 
                           "some_value"))                             # default (optional)

Continued discussion here.

Credit:
Some of the logic is inspired by this implementation which at the time of constructing the PR was contributed to by @samukweku @Zeroto521 @thatlittleboy and @ericmjl

closes ENH: Dedicated method for creating conditional columns #39154
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

erfannariman · 2023-01-06T11:43:22Z

I am not sure if I would've implemented it with an anonymous function (eg. lambda). Since np.select expects boolean arrays, so we could just use a standard pandas condition df["a"] == "some value"

ELHoussineT · 2023-01-06T12:45:26Z

@erfannariman Thanks for taking a look.

I am not sure if I would've implemented it with an anonymous function (eg. lambda).

Indeed, here is a fixup: 7ad6b10

Since np.select expects boolean arrays, so we could just use a standard pandas condition df["a"] == "some value"

I didn't understand your statement fully, can you please elaborate?

rhshadrach · 2023-01-06T21:32:33Z

pandas/core/case_when.py

+    default : Any, default is `None`.
+        The default value to be used if all conditions evaluate False.


When default is not specified, I believe the current implementation will convert e.g. int to object dtypes. That seems undesirable to me.

Indeed. For now, we can use np.nan. Pushed a fixup: be5a3b8

Using np.nan also preserves date related dtypes properly. To illustrate:

>>> df = pd.DataFrame(dict(a=[1, 2, 3])) >>> result = df.assign( ... new_column=pd.case_when( ... lambda x: x.a == 1, pd.Timestamp("today") ... ) ... ) >>> result.new_column 0 2023-01-07 18:58:28.654455 1 NaT 2 NaT Name: new_column, dtype: datetime64[ns]

I don't think the default of np.nan will work for all dtypes, e.g. in my experience it's common (but not universal) to have None instead of np.nan be the NA value for string. Leaving it as np.nan might unexpectedly coerce the dtype on some inputs as well, surprising users.

With this, it seems better to me to have default be a required argument.

Also, how does the performance compare between:

df['a'].mask(cond1, value1).mask(~cond1 & cond2, value2) pd.case_when(cond1, value1, cond2, value2, default=df['a'])(df)

With this, it seems better to me to have default be a required argument.

Good idea, done.

Also, how does the performance compare between:

Is there a need to check this performance specially since case_when uses Series.mask now?

rhshadrach · 2023-01-06T21:49:43Z

pandas/core/case_when.py

+import pandas.core.common as com
+
+
+def case_when(*args, default: Any = lib.no_default) -> Callable:


Users wanting to use this without assign would be calling something like

pd.case_when(cond1, val1, cond2, val2)(df)

That seems odd, and it's much more natural to have the DataFrame as an argument. On the other hand, that wouldn't allow as much usage with assign via method chaining if the DataFrame is changed as part of the chain.

On the other hand, that wouldn't allow as much usage with assign via method chaining if the DataFrame is changed as part of the chain.

Exactly, what do you suggest in this case?

I don't think we should start adding functions to the API just to support method chaining in assign. Users can still use this with a simple wrapper, e.g.

def chain_case_when(df, *args, **kwargs): return lambda df: pd.case_when(df, *args, **kwargs) result = ( ... .assign(new_col=chain_case_when(lambda x: x.a == 1, 'first', lambda x: (x.a > 1), 'second', default='default')) )

A few other thoughts on the API. A few of these repeat other comments here, just adding so it's all in one place.

I think the default argument should have no default (must be user-specified) ENH: Add support for multiple conditions assign statement #50343 (comment)

Return value should be pandas object (Series)

The top level method is necessary for constructing a Series where the default is a scalar. However it also seems natural to also have Series.case_when where default is the provided Series (so not an argument in the signature). We could also have DataFrame.case_when be an alias for pd.case_when(df, ...) but that doesn't seem necessary.

Needs to support EAs (ENH: Add support for multiple conditions assign statement #50343 (comment))

Does the current implementation support replacements (values) being Series? I think the answer is yes, but wanted to be sure. This should be added to docs and tests.

The tests should also be expanded:

condition being the Boolean EA

dtypes for values including bool, string, and EAs (numeric, categorical, datetimelike, period)

Malformed arguments. E.g. what happens when a condition is something unexpected like a scalar or a list / ndarray / Series that is the wrong length or not aligned?

I personally feel pd.Series.mask should be used instead of np.select. the implementation of pd.Series.mask takes care of a lot of things, including dtypes and alignment.

Having case_when as a function makes it versatile, one can use it within assign or on its own. Agreed that it should return a pandas object, preferably a series.

Agreed on using mask (or mask-like internals if there is unnecessary logic that is wasting cycles)

Having case_when as a function makes it versatile

Just for clarification, is this opposition to also having Series.case_when?

Since pd.case_when returns a pandas object I don't see a need for series.case_when( although it is convenient).

Agreed it's not logically necessary, but it seems to be a common use case and using the a top level function is unnecessarily redundant:

# I'm assuming case_when supports Series pd.case_when(ser, cond1, val1, cond2, val2, default=ser)

as opposed to

ser.case_when(cond1, val1, cond2, val2)

Kindly see #50343 (comment)

samukweku · 2023-01-07T21:53:57Z

pandas/core/case_when.py

+
+        for index, value in enumerate(args):
+            if not index % 2:
+                if callable(value):


I think you can skip the callable check

With the switch to Series.mask (see the switch here), checking for callables is necessary to support passing conditions to case_when as a callable.

samukweku · 2023-01-07T21:55:10Z

pandas/core/case_when.py

+            else:
+                replacements.append(value)
+
+        return np.select(booleans, replacements, default=default)


What happens if there is an extension dtype? I don't think numpy select handles that well

Kindly see #50343 (comment)

github-actions · 2023-02-22T00:05:53Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

samukweku · 2023-02-22T00:23:51Z

@ELHoussineT how's it going?

ELHoussineT · 2023-02-22T09:45:25Z

Sorry I was away. Will push commits this weekend. @samukweku

* Used to support conditional assignment operation.

…ration.

ELHoussineT · 2023-03-15T19:31:53Z

@rhshadrach How is it looking, shall I proceed and write the tests and docs to wrap up this PR?

rhshadrach · 2023-03-16T02:53:32Z

pandas/core/case_when.py

+    if is_array_like(default):
+        series = pd.Series(default).reset_index(drop=True)
+    else:
+        series = pd.Series([default] * obj.shape[0])


Instead of reseting the index, can you do pd.Series(default, index=self.index) . This can be done in the else clause as well. Then I think you can remove the resetting of the index below.

Why use is_array_like over is_list_like?

Why use is_array_like over is_list_like?

You are right, updated. fd12a30

Instead of reseting the index, can you do pd.Series(default, index=self.index) . This can be done in the else clause as well. Then I think you can remove the resetting of the index below.

If we do so (see b897b8e), then the following case will have an undesired behavior as illustrated in this example:

>>> df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]), index=['index 1', 'index 2', 'index 3']) >>> df a b index 1 1 4 index 2 2 5 index 3 3 6 >>> pd.case_when( ... df, ... lambda x: (x.a == 1) & (x.b == 4), ... pd.Series([-1,-1,-1], index=['other index 1', 'other index 1', 'other index 1']), ... default=0, ... ) index 1 NaN index 2 0.0 index 3 0.0 dtype: float64

As a result, I overwrote the index and produced a warning to not shock the user with this index overriding. See d01d6e8. This behavior is similar to what was done in other places in Pandas. For example:

https://github.com/ELHoussineT/pandas/blob/d01d6e897e0a567f9d9c8e3f4bc5fc3aa18d6fe8/pandas/core/frame.py#L3800-L3805

If the logic looks good, kindly indicate so as I will proceed then to write the tests and docs. Otherwise, we can also continue to iterate on the logic.

This behavior is similar to what was done in other places in Pandas.

The behavior you're highlighting here is pandas aligning the index to self's index. The logic in this PR entirely ignores the index. I would expect pandas to align the index, so that the NaN value in the first row is expected.

As an aside, I also don't think we should be warning in the link above, but that's a separate issue.

Hmm.. I see. What do you suggest in this case?

…ration.

rhshadrach · 2023-03-18T12:52:00Z

pandas/core/case_when.py

+    if is_list_like(default):
+        series = pd.Series(default.values, index=obj.index)
+    else:
+        series = pd.Series([default] * obj.shape[0], index=obj.index)


I think use pd.Series(default, index=obj.index) in both cases here.

Makes sense. Updated eb70eaf.

rhshadrach · 2023-03-18T12:52:16Z

pandas/core/case_when.py

+        if isinstance(replacements, pd.Series) and not replacements.index.equals(
+            obj.index
+        ):
+            replacements = warn_and_override_index(
+                replacements, f"(in args[{i+1}])", obj.index
+            )
+
+        if isinstance(conditions, pd.Series) and not conditions.index.equals(obj.index):
+            conditions = warn_and_override_index(
+                conditions, f"(in args[{i}])", obj.index
+            )


Remove this entirely, mask will align.

Sure, updated. da79aa6

Please take another look, if it looks good, I will proceed to tests and docs. Thanks.

…ration.

ELHoussineT · 2023-03-23T17:40:21Z

@rhshadrach PTAL.

rhshadrach · 2023-03-31T03:56:28Z

pandas/core/case_when.py

+        replacements = args[i + 1]
+
+        # `Series.mask` call
+        series = series.mask(conditions, replacements)


I believe we want to only modify the value on the first conditions that evaluates to True (correct me if you think this is wrong!); this will modify it for all conditions. One approach is to maintain a modified that starts as pd.Series(False, index=self.index). Then this line can become series = series.mask(~modified & conditions, replacements). After this line, we also need to update modified |= conditions.

Good point. Updated (f2bf4d8).

Now, if multiple conditions are met, the value of the first one is used. For example:

>>> df a b index 1 1 4 index 2 2 5 index 3 3 6 >>> pd.case_when( ... df, ... lambda x: x.a > 0, ... 1, ... lambda x: x.a == 1, ... -1, ... default='default' ) index 1 1 index 2 1 index 3 1 dtype: object

…ration.

ELHoussineT · 2023-04-02T02:04:19Z

@rhshadrach I also had to add .convert_dtypes() a5d678f.

This is necessary since the default always determines the first dtype of the series even if the final series does not have any default value.

To illustrate, see [1] which does not have .convert_dtypes() while [2] has it. Notice how [2] has the correct dtype in the result series.

[1] :

>>> df 
         a  b
index 1  1  4
index 2  2  5
index 3  3  6

>>> pd.case_when(
... df,
... lambda x: x.a > 0,
... 1,
... lambda x: x.a == 1, 
... -1,
... default='default'
)
index 1    1
index 2    1
index 3    1
dtype: object

[2] :

>>> df 
         a  b
index 1  1  4
index 2  2  5
index 3  3  6

>>> pd.case_when(
... df,
... lambda x: x.a > 0,
... 1,
... lambda x: x.a == 1, 
... -1,
... default='default'
)
index 1    1
index 2    1
index 3    1
dtype: Int64

rhshadrach · 2023-04-05T20:57:31Z

@rhshadrach I also had to add .convert_dtypes() a5d678f.

This is necessary since the default always determines the first dtype of the series even if the final series does not have any default value.

Thanks for identifying this, it is indeed something I had not considered. I do not think Int64 is the right answer in the example you gave. As values of the input change, the dtype of the result can change, which can then lead to radically different answers later on. It is an example of "values-dependent behavior" which we try to avoid if at all possible.

I think we should be using something like find_common_type to determine what the output dtype is based on the input dtypes alone (and not values!). However I don't know if this is readily used since this function accepts scalars. In short, I'm not sure of what a good solution looks like here.

ELHoussineT · 2023-04-08T14:40:52Z

@rhshadrach

I do not think Int64 is the right answer in the example you gave.

May I ask why? It is still not super clear to me why can't we proceed with .convert_dtypes().

Thanks

rhshadrach · 2023-04-09T13:37:22Z

May I ask why?

I included why I think to be the case in my initial response.

…ment operation." This reverts commit a5d678f.

…ration.

ELHoussineT · 2023-04-13T23:44:15Z

@rhshadrach

Understood and I agree.

In this case, I would suggest to remove .convert_dtypes() (see 072bc0c) and mention this limitation in the docstring (see 747b379).

The reason why I suggested the above is that this issue is built in Series.mask() and keeping the behavior consistent for now might be beneficial.

To illustrate how the issue is in Series.mask(), see the dtype of the result Series below:

>>> s = pd.Series(['a','b'])
0    a
1    b
dtype: object

>>> s.mask([True, True], 1)
0    1
1    1
dtype: object

Keeping that behavior as is in pd.case_when will maintain the consistency (even through one could argue it is an undesired behavior) and if this gets fixed in Series.mask the changes will propagate to pd.case_when as it uses Series.mask under the hood.

I am also happy to tackle this issue in Series.mask after we are done with this PR. I am optimistic that we can find a solution.

Looking forward to your reply. If you agree with the above, please indicate that so I proceed with tests & docs.

Best,

ELHoussineT · 2023-04-13T23:58:54Z

@rhshadrach FYI #52662

ELHoussineT · 2023-05-02T05:11:30Z

@rhshadrach PTAL.

rhshadrach

The implementation looks good, but this needs many additional tests. In particular, we should be testing the behavior when invalid inputs are given.

rhshadrach · 2023-05-03T02:26:38Z

doc/source/whatsnew/v2.0.0.rst

+Assignment based on multiple conditions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``pd.case_when`` API has now been added to support assignment based on multiple conditions.


Add a proper link to the API documentation: :func:`case_when`

Also, can you refer to this as a function rather than an API.

rhshadrach · 2023-05-03T02:27:23Z

pandas/core/case_when.py

+
+from pandas.util._exceptions import find_stack_level
+
+import pandas as pd


Don't import all of pandas, only what you need from various modules.

rhshadrach · 2023-05-03T02:29:04Z

pandas/core/case_when.py

+
+def case_when(obj: pd.DataFrame | pd.Series, *args, default: Any) -> pd.Series:
+    """
+    Returns a Series based on multiple conditions assignment.


Can we say "Construct" instead of Returns. Also, "multiple conditions assignment" sounds off to me, I would recommend just "multiple conditions".

rhshadrach · 2023-05-03T02:31:05Z

pandas/core/case_when.py

+    """
+    Returns a Series based on multiple conditions assignment.
+
+    This is useful when you want to assign a column based on multiple conditions.


This can be used independently of assigning a column, I suggest this be removed.

rhshadrach · 2023-05-03T02:31:36Z

pandas/core/case_when.py

+    Returns a Series based on multiple conditions assignment.
+
+    This is useful when you want to assign a column based on multiple conditions.
+    Uses `Series.mask` to perform the assignment.


I think this is an implementation detail.

rhshadrach · 2023-05-03T02:38:59Z

pandas/core/case_when.py

+        else:
+            conditions = args[i]
+
+        # get replacements


rhshadrach · 2023-05-03T02:39:07Z

pandas/core/case_when.py

+        # get replacements
+        replacements = args[i + 1]
+
+        # `Series.mask` call


rhshadrach · 2023-05-03T02:39:14Z

pandas/core/case_when.py

+        # `Series.mask` call
+        series = series.mask(~modified & conditions, replacements)
+
+        # keeping track of which row got modified


rhshadrach · 2023-05-03T02:39:52Z

pandas/tests/test_case_when.py

@@ -0,0 +1,61 @@
+import numpy as np
+import pytest  # noqa


Why is this needed? Include the error code when using noqa.

rhshadrach · 2023-05-03T02:40:40Z

pandas/tests/test_case_when.py

+import pandas._testing as tm
+
+
+class TestCaseWhen:


If not making use of the class, then use test functions instead of methods.

mroeschke · 2023-08-25T18:08:01Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

ELHoussineT marked this pull request as draft December 19, 2022 14:23

ELHoussineT mentioned this pull request Dec 19, 2022

ENH: Dedicated method for creating conditional columns #39154

Closed

samukweku mentioned this pull request Dec 31, 2022

Use kwargs in case_when pyjanitor-devs/pyjanitor#1221

Closed

ELHoussineT force-pushed the conditional-assignment branch 5 times, most recently from 3b775e3 to 7b98fb5 Compare January 6, 2023 11:10

ELHoussineT marked this pull request as ready for review January 6, 2023 11:10

ELHoussineT force-pushed the conditional-assignment branch from 7b98fb5 to 199c947 Compare January 6, 2023 11:10

rhshadrach reviewed Jan 6, 2023

View reviewed changes

ELHoussineT force-pushed the conditional-assignment branch 3 times, most recently from be5a3b8 to 216a36f Compare January 7, 2023 21:37

samukweku reviewed Jan 7, 2023

View reviewed changes

ELHoussineT force-pushed the conditional-assignment branch from 216a36f to 54d92f6 Compare January 22, 2023 18:56

github-actions bot added the Stale label Feb 22, 2023

simonjayhawkins added Enhancement and removed Stale labels Feb 22, 2023

ELHoussineT added 3 commits March 8, 2023 15:17

Add case_when API

1f5156e

* Used to support conditional assignment operation.

fixup! Add case_when API * Used to support conditional assignment ope…

9e0238f

…ration.

fixup! Add case_when API * Used to support conditional assignment ope…

45352da

…ration.

ELHoussineT force-pushed the conditional-assignment branch from 54d92f6 to 45352da Compare March 8, 2023 14:17

rhshadrach requested changes Mar 16, 2023

View reviewed changes

ELHoussineT added 3 commits March 16, 2023 19:23

fixup! Add case_when API * Used to support conditional assignment ope…

fd12a30

…ration.

fixup! Add case_when API * Used to support conditional assignment ope…

b897b8e

…ration.

fixup! Add case_when API * Used to support conditional assignment ope…

d01d6e8

…ration.

rhshadrach requested changes Mar 18, 2023

View reviewed changes

ELHoussineT added 2 commits March 19, 2023 16:32

fixup! Add case_when API * Used to support conditional assignment ope…

da79aa6

…ration.

fixup! Add case_when API * Used to support conditional assignment ope…

eb70eaf

…ration.

rhshadrach requested changes Mar 31, 2023

View reviewed changes

ELHoussineT added 2 commits April 2, 2023 01:53

fixup! Add case_when API * Used to support conditional assignment ope…

f2bf4d8

…ration.

fixup! Add case_when API * Used to support conditional assignment ope…

a5d678f

…ration.

ELHoussineT added 2 commits April 13, 2023 23:39

Revert "fixup! Add case_when API * Used to support conditional assign…

072bc0c

…ment operation." This reverts commit a5d678f.

fixup! Add case_when API * Used to support conditional assignment ope…

747b379

…ration.

ELHoussineT mentioned this pull request Apr 13, 2023

BUG: NDFrame._where returns an unexpected dtype #52662

Closed

3 tasks

rhshadrach requested changes May 3, 2023

View reviewed changes

ELHoussineT added 2 commits May 31, 2023 20:05

Merge branch 'pandas-dev:main' into conditional-assignment

69de3f1

Merge branch 'pandas-dev:main' into conditional-assignment

7d19fb8

mroeschke closed this Aug 25, 2023

This was referenced Sep 27, 2023

ENH: case_when function #55306

Closed

[ENH] case_when function #55390

Closed

samukweku mentioned this pull request Nov 19, 2023

ENH: Add case_when method #56059

Merged

5 tasks

		default : Any, default is `None`.
		The default value to be used if all conditions evaluate False.

		import pandas.core.common as com


		def case_when(*args, default: Any = lib.no_default) -> Callable:


		from pandas.util._exceptions import find_stack_level

		import pandas as pd

ENH: Add support for multiple conditions assign statement #50343

ENH: Add support for multiple conditions assign statement #50343

Conversation

ELHoussineT commented Dec 19, 2022 • edited Loading

erfannariman commented Jan 6, 2023

ELHoussineT commented Jan 6, 2023 • edited Loading

rhshadrach Jan 6, 2023 • edited Loading

Choose a reason for hiding this comment

ELHoussineT Jan 7, 2023 • edited Loading

Choose a reason for hiding this comment

rhshadrach Jan 8, 2023 • edited Loading

Choose a reason for hiding this comment

ELHoussineT Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Jan 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Jan 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ELHoussineT Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 22, 2023

samukweku commented Feb 22, 2023

ELHoussineT commented Feb 22, 2023

ELHoussineT commented Mar 15, 2023

rhshadrach Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ELHoussineT commented Mar 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ELHoussineT commented Apr 2, 2023

rhshadrach commented Apr 5, 2023

ELHoussineT commented Apr 8, 2023

rhshadrach commented Apr 9, 2023

ELHoussineT commented Apr 13, 2023

ELHoussineT commented Apr 13, 2023

ELHoussineT commented May 2, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Aug 25, 2023

ELHoussineT commented Dec 19, 2022 •

edited

Loading

ELHoussineT commented Jan 6, 2023 •

edited

Loading

rhshadrach Jan 6, 2023 •

edited

Loading

ELHoussineT Jan 7, 2023 •

edited

Loading

rhshadrach Jan 8, 2023 •

edited

Loading

ELHoussineT Mar 8, 2023 •

edited

Loading

rhshadrach Jan 8, 2023 •

edited

Loading

rhshadrach Jan 8, 2023 •

edited

Loading

ELHoussineT Mar 8, 2023 •

edited

Loading

rhshadrach Mar 16, 2023 •

edited

Loading

rhshadrach Mar 16, 2023 •

edited

Loading