Skip to content

API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

While working on adding any/all support for the object-dtype version of the StringDtype in #58451, I bumped into some issues with the current testing and implementation.

It seemed that our current testing is a bit limited, so I expanded the existing test to cover more cases and to be parametrized over all string dtype variants (including plain object dtype).

But that uncovered some issues with the current implementation of the pyarrow-backed version:

  • any could return pd.NA in case of skipna=False, while for this version of the string dtype (not using Kleene logic), the result should always be True or False. This is an easy fix (we currently were only filling missing values in case of all, also do this for any)
  • How to treat missing values? As truthy or falsey? The current pyarrow-based implementation explicitly filled the missing values with False, but the documentation says "If skipna is False, then NA are treated as True, because these are not equal to zero."
    I find that a bit strange (my first expectation would be to treat missing values as False), but it is explicitly documented, and follows the behaviour of current object dtype (eg pd.Series(["a". np.nan], dtype=object).all(skipna=False) gives True, not False.

So for now this PR updated the pyarrow implementation to follow the documented behaviour.

  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jorisvandenbossche jorisvandenbossche added API Design Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. labels Aug 5, 2024
Comment on lines -701 to +702
nas = pc.invert(pc.is_null(self._pa_array))
arr = pc.and_kleene(nas, pc.not_equal(self._pa_array, ""))
if not skipna:
nas = pc.is_null(self._pa_array)
arr = pc.or_kleene(nas, pc.not_equal(self._pa_array, ""))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, for converting the string array to a boolean array, True values were considered as "not "", and not null", and this diff changed that to "not "", or null"

assert ser.any()
assert ser.all()
assert not ser.all(skipna=False)
assert ser.any(skipna=False)
assert ser.all(skipna=False) # NaN is considered truthy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The places where I currently commented this # NaN is considered truthy are the test cases where the result would change from True to False if NaN would be considered as falsey instead.

@mroeschke
Copy link
Member

In 3.0, adjacently .idxmax/min will raise a ValueError if skipna=False and a NA value is encountered. There could be an API behavior argument to align with that behavior in the future.

Besides that point, I would also agree with your initial expectation that NA values are falsey.

@jorisvandenbossche
Copy link
Member Author

In 3.0, adjacently .idxmax/min will raise a ValueError if skipna=False and a NA value is encountered. There could be an API behavior argument to align with that behavior in the future.

For any/all, I think in the future we don't have to raise about missing values, but we can use Kleene logic and return pd.NA in some cases (although that might then just give an error downstream in your code ..).

That's the behaviour that is currently implemented for the nullable BooleanDtype, and if we plan to stick to that behaviour, I would say it's not worth to change the current NaN-based dtypes to start raise an error for NaNs (and we can preserve the strange "NaN is considered as True because it is not equal to 0" for now).

@mroeschke mroeschke merged commit ac69522 into pandas-dev:main Aug 6, 2024
39 of 45 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-reductions-any-all branch August 7, 2024 07:20
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024
jorisvandenbossche added a commit that referenced this pull request Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design backported Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants