Skip to content

BUG: DataFrame.any() with not returning Boolean series with skipna=False across columns with numeric and string types #38962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
rmwenzel opened this issue Jan 5, 2021 · 11 comments
Labels
Bug Duplicate Report Duplicate issue or pull request Upstream issue Issue related to pandas dependency

Comments

@rmwenzel
Copy link

rmwenzel commented Jan 5, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

If the dataframe has columns have different numeric types

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 0], [np.nan, 1]], columns=list('ab'))
df.any(axis=1, skipna=False)

the behavior is as expected

0    True
1    True
dtype: bool

But if the columns have numeric and string types

df = pd.DataFrame([[1, 's'], [np.nan, 't']], columns=list('ab'))
df.any(axis=1, skipna=False)

it looks like the first column is returned

0      1
1    NaN
dtype: object

The mixed types across rows doesn't seem to be a problem, if I do

df = pd.DataFrame([['s', 1], [np.nan, 't']], columns=list('ab'))
df.any(axis=0, skipna=False)

I get

a    True
b    True
dtype: bool

Also don't have any issues if skipna=True.

Expected behavior is to get a series of booleans either way.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.9.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.19.2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.0.0.post20201207
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@rmwenzel rmwenzel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 5, 2021
@mzeitlin11
Copy link
Member

Thanks for the report @rmwenzel! This looks to be caused because by the following numpy behavior:

arr = np.array([[1, 's'], [np.nan, 't']]).astype(object)
arr.any(axis=1)

gives

['1' 'nan']

In your second example which works as you expected (assuming you meant axis=0, instead of axis=1?), the numpy .any() step still does not give a boolean result, it just ends up being coerced to a boolean result on the pandas end.

@mzeitlin11
Copy link
Member

In the 1d case too:

arr = np.array(['s', 1], dtype=object)
arr.any()

gives s, which goes against numpy docs that say .any() should return a boolean if axis is None. Seems like an upstream issue then?

@rmwenzel
Copy link
Author

rmwenzel commented Jan 5, 2021

@mzeitlin11 thanks for the speedy response! I did mean axis=0 in the second example I edited to reflect.

By upstream I take it you mean a numpy issue? If so I can close this and go file a report there.

@mzeitlin11
Copy link
Member

@mzeitlin11 thanks for the speedy response! I did mean axis=0 in the second example I edited to reflect.

By upstream I take it you mean a numpy issue? If so I can close this and go file a report there.

That would be great! I think we should still leave this open, though, pending resolution of the numpy issue (and testing should be added for these cases (even if it has to be xfailed until the issue is fixed)).

@mzeitlin11 mzeitlin11 added Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 5, 2021
@rmwenzel
Copy link
Author

rmwenzel commented Jan 6, 2021

Done! See numpy/numpy#18129

@rmwenzel
Copy link
Author

rmwenzel commented Jan 7, 2021

@mzeitlin11 after some discussion, it looks like np.any() is behaving as expected. See the above referenced issue for details.

@mzeitlin11
Copy link
Member

Thanks @rmwenzel! I guess the question becomes whether pandas should match that behavior or always ensure returning something boolean. Your examples above show an inconsistency for how this is handled when reducing on different axes.

@mzeitlin11 mzeitlin11 added the Needs Discussion Requires discussion from core team before further action label Jan 8, 2021
@rmwenzel
Copy link
Author

rmwenzel commented Jan 9, 2021

@mzeitlin11 I was wondering the same thing. Consistency with behavior for rows (axis=0) seems important, to me it seems (e.g. from the documentation) that this was the original in intent, and it could be pretty confusing to have the inconsistency. Maybe more important, it seems more useful to have the boolean. In my (somewhat limited) experience, one is using any() like a logical or. To me it makes sense to have truthy values register as true, non-truthy as false, and return the logical or of that. It's harder to see the use of returning the first truthy value.

@mzeitlin11
Copy link
Member

def nanany(
values: np.ndarray,
*,
axis: Optional[int] = None,
skipna: bool = True,
mask: Optional[np.ndarray] = None,
) -> bool:

Probably just want to ensure this returns only bools (typing here is also suspect because it can return an array of bools in addition to just a single boolean value).

@mzeitlin11
Copy link
Member

Closing as a duplicate in favor of #12863, feel free to chime in there if interested in working on this. Also looks like an issue on the numpy side was previously raised that gives a different answer than you got @rmwenzel (numpy/numpy#4352)

@mzeitlin11 mzeitlin11 added Duplicate Report Duplicate issue or pull request and removed Needs Discussion Requires discussion from core team before further action labels Jan 10, 2021
@rmwenzel
Copy link
Author

@mzeitlin11 Thanks for catching that other issue. Apologies for not digging deep enough to find it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

2 participants