-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414
Conversation
nas = pc.invert(pc.is_null(self._pa_array)) | ||
arr = pc.and_kleene(nas, pc.not_equal(self._pa_array, "")) | ||
if not skipna: | ||
nas = pc.is_null(self._pa_array) | ||
arr = pc.or_kleene(nas, pc.not_equal(self._pa_array, "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, for converting the string array to a boolean array, True values were considered as "not ""
, and not null", and this diff changed that to "not ""
, or null"
assert ser.any() | ||
assert ser.all() | ||
assert not ser.all(skipna=False) | ||
assert ser.any(skipna=False) | ||
assert ser.all(skipna=False) # NaN is considered truthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The places where I currently commented this # NaN is considered truthy
are the test cases where the result would change from True to False if NaN would be considered as falsey instead.
In 3.0, adjacently Besides that point, I would also agree with your initial expectation that NA values are falsey. |
For That's the behaviour that is currently implemented for the nullable BooleanDtype, and if we plan to stick to that behaviour, I would say it's not worth to change the current NaN-based dtypes to start raise an error for NaNs (and we can preserve the strange "NaN is considered as True because it is not equal to 0" for now). |
Thanks @jorisvandenbossche |
…d implementation (#59414)
While working on adding any/all support for the object-dtype version of the StringDtype in #58451, I bumped into some issues with the current testing and implementation.
It seemed that our current testing is a bit limited, so I expanded the existing test to cover more cases and to be parametrized over all string dtype variants (including plain object dtype).
But that uncovered some issues with the current implementation of the pyarrow-backed version:
any
could returnpd.NA
in case ofskipna=False
, while for this version of the string dtype (not using Kleene logic), the result should always be True or False. This is an easy fix (we currently were only filling missing values in case ofall
, also do this forany
)I find that a bit strange (my first expectation would be to treat missing values as False), but it is explicitly documented, and follows the behaviour of current object dtype (eg
pd.Series(["a". np.nan], dtype=object).all(skipna=False)
gives True, not False.So for now this PR updated the pyarrow implementation to follow the documented behaviour.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.