BUG: DataFrame.any() with not returning Boolean series with skipna=False across columns with numeric and string types #38962

rmwenzel · 2021-01-05T03:33:00Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.

If the dataframe has columns have different numeric types

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 0], [np.nan, 1]], columns=list('ab'))
df.any(axis=1, skipna=False)

the behavior is as expected

0    True
1    True
dtype: bool

But if the columns have numeric and string types

df = pd.DataFrame([[1, 's'], [np.nan, 't']], columns=list('ab'))
df.any(axis=1, skipna=False)

it looks like the first column is returned

0      1
1    NaN
dtype: object

The mixed types across rows doesn't seem to be a problem, if I do

df = pd.DataFrame([['s', 1], [np.nan, 't']], columns=list('ab'))
df.any(axis=0, skipna=False)

I get

a    True
b    True
dtype: bool

Also don't have any issues if skipna=True.

Expected behavior is to get a series of booleans either way.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : b5958ee
python : 3.9.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.19.2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.0.0.post20201207
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-01-05T04:47:06Z

Thanks for the report @rmwenzel! This looks to be caused because by the following numpy behavior:

arr = np.array([[1, 's'], [np.nan, 't']]).astype(object)
arr.any(axis=1)

gives

['1' 'nan']

In your second example which works as you expected (assuming you meant axis=0, instead of axis=1?), the numpy .any() step still does not give a boolean result, it just ends up being coerced to a boolean result on the pandas end.

mzeitlin11 · 2021-01-05T04:54:30Z

In the 1d case too:

arr = np.array(['s', 1], dtype=object)
arr.any()

gives s, which goes against numpy docs that say .any() should return a boolean if axis is None. Seems like an upstream issue then?

rmwenzel · 2021-01-05T05:49:28Z

@mzeitlin11 thanks for the speedy response! I did mean axis=0 in the second example I edited to reflect.

By upstream I take it you mean a numpy issue? If so I can close this and go file a report there.

mzeitlin11 · 2021-01-05T16:00:01Z

@mzeitlin11 thanks for the speedy response! I did mean axis=0 in the second example I edited to reflect.

By upstream I take it you mean a numpy issue? If so I can close this and go file a report there.

That would be great! I think we should still leave this open, though, pending resolution of the numpy issue (and testing should be added for these cases (even if it has to be xfailed until the issue is fixed)).

rmwenzel · 2021-01-06T04:43:02Z

Done! See numpy/numpy#18129

rmwenzel · 2021-01-07T20:52:41Z

@mzeitlin11 after some discussion, it looks like np.any() is behaving as expected. See the above referenced issue for details.

mzeitlin11 · 2021-01-08T17:02:19Z

Thanks @rmwenzel! I guess the question becomes whether pandas should match that behavior or always ensure returning something boolean. Your examples above show an inconsistency for how this is handled when reducing on different axes.

rmwenzel · 2021-01-09T03:57:26Z

@mzeitlin11 I was wondering the same thing. Consistency with behavior for rows (axis=0) seems important, to me it seems (e.g. from the documentation) that this was the original in intent, and it could be pretty confusing to have the inconsistency. Maybe more important, it seems more useful to have the boolean. In my (somewhat limited) experience, one is using any() like a logical or. To me it makes sense to have truthy values register as true, non-truthy as false, and return the logical or of that. It's harder to see the use of returning the first truthy value.

mzeitlin11 · 2021-01-09T22:16:55Z

pandas/pandas/core/nanops.py

Lines 440 to 446 in eead40e

    
           def nanany( 
        
               values: np.ndarray, 
        
               *, 
        
               axis: Optional[int] = None, 
        
               skipna: bool = True, 
        
               mask: Optional[np.ndarray] = None, 
        
           ) -> bool:

Probably just want to ensure this returns only bools (typing here is also suspect because it can return an array of bools in addition to just a single boolean value).

mzeitlin11 · 2021-01-10T04:02:17Z

Closing as a duplicate in favor of #12863, feel free to chime in there if interested in working on this. Also looks like an issue on the numpy side was previously raised that gives a different answer than you got @rmwenzel (numpy/numpy#4352)

rmwenzel · 2021-01-10T04:42:03Z

@mzeitlin11 Thanks for catching that other issue. Apologies for not digging deep enough to find it!

rmwenzel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 5, 2021

mzeitlin11 added Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 5, 2021

mzeitlin11 added the Needs Discussion Requires discussion from core team before further action label Jan 8, 2021

mzeitlin11 closed this as completed Jan 10, 2021

mzeitlin11 added Duplicate Report Duplicate issue or pull request and removed Needs Discussion Requires discussion from core team before further action labels Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.any() with not returning Boolean series with skipna=False across columns with numeric and string types #38962

BUG: DataFrame.any() with not returning Boolean series with skipna=False across columns with numeric and string types #38962

rmwenzel commented Jan 5, 2021 •

edited

Loading

INSTALLED VERSIONS

mzeitlin11 commented Jan 5, 2021

mzeitlin11 commented Jan 5, 2021

rmwenzel commented Jan 5, 2021 •

edited

Loading

mzeitlin11 commented Jan 5, 2021

rmwenzel commented Jan 6, 2021

rmwenzel commented Jan 7, 2021 •

edited

Loading

mzeitlin11 commented Jan 8, 2021

rmwenzel commented Jan 9, 2021

mzeitlin11 commented Jan 9, 2021

mzeitlin11 commented Jan 10, 2021

rmwenzel commented Jan 10, 2021

BUG: DataFrame.any() with not returning Boolean series with skipna=False across columns with numeric and string types #38962

BUG: DataFrame.any() with not returning Boolean series with skipna=False across columns with numeric and string types #38962

Comments

rmwenzel commented Jan 5, 2021 • edited Loading

Output of pd.show_versions()

INSTALLED VERSIONS

mzeitlin11 commented Jan 5, 2021

mzeitlin11 commented Jan 5, 2021

rmwenzel commented Jan 5, 2021 • edited Loading

mzeitlin11 commented Jan 5, 2021

rmwenzel commented Jan 6, 2021

rmwenzel commented Jan 7, 2021 • edited Loading

mzeitlin11 commented Jan 8, 2021

rmwenzel commented Jan 9, 2021

mzeitlin11 commented Jan 9, 2021

mzeitlin11 commented Jan 10, 2021

rmwenzel commented Jan 10, 2021

rmwenzel commented Jan 5, 2021 •

edited

Loading

Output of `pd.show_versions()`

rmwenzel commented Jan 5, 2021 •

edited

Loading

rmwenzel commented Jan 7, 2021 •

edited

Loading