Skip to content

any/all reductions on boolean object-typed Series #27709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xhochy opened this issue Aug 2, 2019 · 5 comments · Fixed by #41102
Closed

any/all reductions on boolean object-typed Series #27709

xhochy opened this issue Aug 2, 2019 · 5 comments · Fixed by #41102
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.
Milestone

Comments

@xhochy
Copy link
Contributor

xhochy commented Aug 2, 2019

On implementing a boolean based ExtensionArray I stumbled on the case that boolean arrays with missing values (which can only be object-typed in pandas) are kind-of undefined behaviour in Pandas reductions with skipna=False:

The following case should return True according to the docstring of Series.any(skipna=False):

pd.Series([False, None]).any(skipna=False)
# None
pd.Series([None, False]).any(skipna=False)
# False
pd.Series([False, np.nan]).any(skipna=False)
# nan
pd.Series([np.nan, False]).any(skipna=False)
# nan

Whereas when you do the same operation on float columns the behaviour is as documented:

pd.Series([np.nan, 0.]).any(skipna=False)
# True
pd.Series([0, np.nan]).any(skipna=False)
# True

As I have not found a unit test for the above mentioned case with a boolean object column, I suspect that this is rather undefined behaviour then intended.

Three solutions come to my mind:

  1. Document this behaviour in the Series.any() docstring.
  2. Align the behaviour of pd.Series(booleans, dtype=object).any(…) with pd.Series(booleans, dtype=object).astype(float).any(…).
  3. Raise an error when calling any/all on a mixed typed boolean series.
@jorisvandenbossche
Copy link
Member

There is an open issue about this already, I think (will try to look it up). IIRC, the bottom line is that this is numpy behaviour.

@xhochy
Copy link
Contributor Author

xhochy commented Aug 2, 2019

@jorisvandenbossche Thanks! I wasn't able to find that one. So the best solution is to add an example to the docs then?

@jorisvandenbossche
Copy link
Member

See eg #12863 (I seem to remember another issue where I participated, but can't find anything)

@jorisvandenbossche
Copy link
Member

So the best solution is to add an example to the docs then?

Not fully sure. From quickly reading #12863, it seems the idea is that this could be fixed in pandas. And there are also open PRs on the numpy side.

@xhochy
Copy link
Contributor Author

xhochy commented Aug 2, 2019

I've read through the open and closed PRs and issues and am still confused. The issues were in general more about support any/all on object columns of any type, not only bool. I'm a bit more specific here about only booleans.

# These return True
any([np.nan, False]) 
any([False, np.nan])
# These return False 
any([None, False])
any([False, None])

The above operations yield different results, we would want to have True as the result for all according to our documentation. Most issues argue that pandas/numpy should align with the built-in Python behaviour which wouldn't be given then anymore.

I would therefore actually adjust the code with options 2 but this would be a behaviour breaking change.

Align the behaviour of pd.Series(booleans, dtype=object).any(…) with pd.Series(booleans, dtype=object).astype(float).any(…).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants