API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414

jorisvandenbossche · 2024-08-05T09:43:33Z

While working on adding any/all support for the object-dtype version of the StringDtype in #58451, I bumped into some issues with the current testing and implementation.

It seemed that our current testing is a bit limited, so I expanded the existing test to cover more cases and to be parametrized over all string dtype variants (including plain object dtype).

But that uncovered some issues with the current implementation of the pyarrow-backed version:

any could return pd.NA in case of skipna=False, while for this version of the string dtype (not using Kleene logic), the result should always be True or False. This is an easy fix (we currently were only filling missing values in case of all, also do this for any)
How to treat missing values? As truthy or falsey? The current pyarrow-based implementation explicitly filled the missing values with False, but the documentation says "If skipna is False, then NA are treated as True, because these are not equal to zero."
I find that a bit strange (my first expectation would be to treat missing values as False), but it is explicitly documented, and follows the behaviour of current object dtype (eg pd.Series(["a". np.nan], dtype=object).all(skipna=False) gives True, not False.

So for now this PR updated the pyarrow implementation to follow the documented behaviour.

Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…d implementation

jorisvandenbossche · 2024-08-05T09:46:12Z

pandas/core/arrays/string_arrow.py

-                nas = pc.invert(pc.is_null(self._pa_array))
-                arr = pc.and_kleene(nas, pc.not_equal(self._pa_array, ""))
+            if not skipna:
+                nas = pc.is_null(self._pa_array)
+                arr = pc.or_kleene(nas, pc.not_equal(self._pa_array, ""))


Before, for converting the string array to a boolean array, True values were considered as "not "", and not null", and this diff changed that to "not "", or null"

jorisvandenbossche · 2024-08-05T09:47:14Z

pandas/tests/reductions/test_reductions.py

        assert ser.any()
        assert ser.all()
-        assert not ser.all(skipna=False)
+        assert ser.any(skipna=False)
+        assert ser.all(skipna=False)  # NaN is considered truthy


The places where I currently commented this # NaN is considered truthy are the test cases where the result would change from True to False if NaN would be considered as falsey instead.

mroeschke · 2024-08-05T17:59:27Z

In 3.0, adjacently .idxmax/min will raise a ValueError if skipna=False and a NA value is encountered. There could be an API behavior argument to align with that behavior in the future.

Besides that point, I would also agree with your initial expectation that NA values are falsey.

…ions-any-all

jorisvandenbossche · 2024-08-06T15:37:56Z

In 3.0, adjacently .idxmax/min will raise a ValueError if skipna=False and a NA value is encountered. There could be an API behavior argument to align with that behavior in the future.

For any/all, I think in the future we don't have to raise about missing values, but we can use Kleene logic and return pd.NA in some cases (although that might then just give an error downstream in your code ..).

That's the behaviour that is currently implemented for the nullable BooleanDtype, and if we plan to stick to that behaviour, I would say it's not worth to change the current NaN-based dtypes to start raise an error for NaNs (and we can preserve the strange "NaN is considered as True because it is not equal to 0" for now).

mroeschke · 2024-08-06T17:06:31Z

Thanks @jorisvandenbossche

…d implementation (pandas-dev#59414)

…d implementation (#59414)

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

83a8059

…d implementation

jorisvandenbossche added API Design Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. labels Aug 5, 2024

jorisvandenbossche requested a review from phofl August 5, 2024 09:43

jorisvandenbossche commented Aug 5, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into string-dtype-reduct…

868ba40

…ions-any-all

mroeschke approved these changes Aug 6, 2024

View reviewed changes

mroeschke merged commit ac69522 into pandas-dev:main Aug 6, 2024
39 of 45 checks passed

jorisvandenbossche deleted the string-dtype-reductions-any-all branch August 7, 2024 07:20

jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

67caf2d

…d implementation (pandas-dev#59414)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

7195f10

…d implementation (pandas-dev#59414)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

3cddd04

…d implementation (pandas-dev#59414)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

337ef04

…d implementation (pandas-dev#59414)

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

1071bea

…d implementation (pandas-dev#59414)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

8e40d6b

…d implementation (pandas-dev#59414)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

974773d

…d implementation (pandas-dev#59414)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

a9dc596

…d implementation (pandas-dev#59414)

jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

35ebe68

…d implementation (pandas-dev#59414)

jorisvandenbossche added a commit that referenced this pull request Oct 9, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-base…

6fad5c9

…d implementation (#59414)

jorisvandenbossche added the backported label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414

API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414

Uh oh!

jorisvandenbossche commented Aug 5, 2024

Uh oh!

jorisvandenbossche Aug 5, 2024

Uh oh!

jorisvandenbossche Aug 5, 2024

Uh oh!

mroeschke commented Aug 5, 2024

Uh oh!

jorisvandenbossche commented Aug 6, 2024

Uh oh!

Uh oh!

mroeschke commented Aug 6, 2024

Uh oh!

Uh oh!

Uh oh!

API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414

API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414

Uh oh!

Conversation

jorisvandenbossche commented Aug 5, 2024

Uh oh!

jorisvandenbossche Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Aug 5, 2024

Uh oh!

jorisvandenbossche commented Aug 6, 2024

Uh oh!

Uh oh!

mroeschke commented Aug 6, 2024

Uh oh!

Uh oh!