Skip to content

pandas.isnull() has poor behavior for lists #20675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bfollinprm opened this issue Apr 13, 2018 · 4 comments · Fixed by #20971
Closed

pandas.isnull() has poor behavior for lists #20675

bfollinprm opened this issue Apr 13, 2018 · 4 comments · Fixed by #20971
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@bfollinprm
Copy link

bfollinprm commented Apr 13, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.isnull([np.NaN, 'world'])

# returns:
# array([False, False], dtype=bool)

Problem description

The output of pd.isnull/pd.isna on lists depends on the inferred dtype of the numpy conversion.
In cases where the array is inferred to be of a string type, numpy converts np.NaN to the string "nan", which pd.isnull() no longer recognizes as a null value. The following solve the underlying problem, which is numpy auto-inferring a string dtype for mixed lists containing strings and float('nan') float values:

  • explicitly convert to object arrays instead of string arrays, as is done in pd.Series construction
  • convert to a pd.Series object instead of the numpy object (leverages the above)
  • applying pd.isna() in a list comprehension for list objects, e.g.
def isna(a): #a is a list
    np.array([pd.isna(el) for el in a])

Expected Output

array([True, False], dtype=bool)

or

TypeError

if it is undesirable to support lists with mixed float/string types with pd.isnull()

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1048-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.2
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.4
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

As you mention, this is due to np.asarray(...) coercing everything to a string once there is a string in it. I would say this is a bug (or design issue) in numpy, but since it is a well known one, we should workaround it, as we do in other places.

From a quick look, this might be used instead of asarray:

In [58]: pd.core.dtypes.cast.maybe_convert_platform([np.nan, 'world'])
Out[58]: array([nan, 'world'], dtype=object)

PR welcome!

@jorisvandenbossche jorisvandenbossche added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 13, 2018
@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Apr 13, 2018
@bfollinprm
Copy link
Author

I'll work through this hopefully early next week.

@bfollinprm
Copy link
Author

I want to update so I'm not thought of as a raise-and-dump kind of person: awaiting approval to contribute from work, but have a fix ready if/when that happens.

@jorisvandenbossche
Copy link
Member

OK, hopefully that will be no problem, and looking forward to the PR

@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants