BUG: where(...) unexpectedly casts None to np.NA for series with pd.Int64Dtype #34536

michellejzh · 2020-06-02T20:46:44Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Problem description

When trying to replace all pd.NA values with None using where(...) on a pd.Series with dtype pd.Int64Dtype, None values are automatically coerced to pd.NA in a way inconsistent with other dtypes. Additionally, the try_cast flag doesn't control this behavior.

Expected Output

I expect that when try_cast=False, None will not be coerced into np.NA, and the dtype of the series will instead be cast into object to accommodate the newly introduced None value, as occurs for other dtypes.

Example

In the following:

s = pd.Series([1, 2, 3, pd.NA], dtype=pd.Int64Dtype())
s.where(pd.notnull(s), None)

the None I'm trying to replace pd.NA with is automatically cast back into pd.NA. This occurs even though try_cast is set to False by default, which I would expect to control this casting behavior.

0       1
1       2
2       3
3    <NA>
dtype: Int64

This is in contrast to the behavior of other dtypes, like numpy's float dtype:

s = pd.Series([1, 2, 3, np.nan], dtype=np.dtype("float64"))
s.where(pd.notnull(s), None)

which outputs the series with the dtype coerced to object after incorporating None:

0       1
1       2
2       3
3    None
dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.0.4
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-06-03T11:21:01Z

If you convert it to object dtype first, then it actually replaces NA with None:

>>> s.astype(object).where(pd.notnull(s), None)   
0       1
1       2
2       3
3    None
dtype: object

Personally, I might find it good behaviour that it actually by default tries to preserve the dtype. But it's certainly inconsistent with how it works for other dtypes.

jorisvandenbossche · 2020-06-03T11:27:43Z

The question is a bit: what values are considered "valid" for the current dtype, and what not?

Because for example for the numpy integer dtype, we also coerce the other if possible to some extent.
Here I provide a float, but it still results in an int series:

In [51]: s = pd.Series([1, 2, 3])

In [52]: s.where(s == 1, 2.0)
Out[52]: 
0    1
1    2
2    2
dtype: int64

And only when the value cannot be interpreted as an int, it gets upcasted to float or object dtype.

Now, for the nullable integer dtype, we actually raise an error if it doesn't fit in integers:

In [62]: s = pd.Series([1, 2, 3], dtype="Int64")

In [63]: s.where(s == 1, 2.5) 
...
TypeError: cannot safely cast non-equivalent object to int64

(so in that light coercing None to NA certainly makes sense)

jorisvandenbossche · 2020-06-03T11:32:44Z

Related where issue: #24144

jbrockmendel · 2021-08-03T19:10:49Z

@jorisvandenbossche is right, the solution here is to cast to object dtype

phofl · 2023-01-19T21:43:59Z

Closing as expected behavior, masked dtypes don't upcast when setting incompatible values

jbrockmendel · 2023-01-19T21:48:18Z

I think this is worth keeping open, as the lack of consistency with other dtypes is a pretty common complaint.

phofl · 2023-01-19T21:49:14Z

I'd rather open an issue about documenting this? Was planning on doing this anyway after the rc for 2.0 is out

jbrockmendel · 2023-01-19T22:56:39Z

The issue isnt the docs its the desired behavior. I'd be open to changing/deprecating the non-nullable behavior to not cast, but think we should be consistent throughout.

michellejzh added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2020

dsaxton added Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2020

phofl closed this as completed Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: where(...) unexpectedly casts None to np.NA for series with pd.Int64Dtype #34536

BUG: where(...) unexpectedly casts None to np.NA for series with pd.Int64Dtype #34536

michellejzh commented Jun 2, 2020 •

edited

Loading

INSTALLED VERSIONS

jorisvandenbossche commented Jun 3, 2020

jorisvandenbossche commented Jun 3, 2020

jorisvandenbossche commented Jun 3, 2020

jbrockmendel commented Aug 3, 2021

phofl commented Jan 19, 2023

jbrockmendel commented Jan 19, 2023

phofl commented Jan 19, 2023

jbrockmendel commented Jan 19, 2023

BUG: where(...) unexpectedly casts None to np.NA for series with pd.Int64Dtype #34536

BUG: where(...) unexpectedly casts None to np.NA for series with pd.Int64Dtype #34536

Comments

michellejzh commented Jun 2, 2020 • edited Loading

Problem description

Expected Output

Example

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Jun 3, 2020

jorisvandenbossche commented Jun 3, 2020

jorisvandenbossche commented Jun 3, 2020

jbrockmendel commented Aug 3, 2021

phofl commented Jan 19, 2023

jbrockmendel commented Jan 19, 2023

phofl commented Jan 19, 2023

jbrockmendel commented Jan 19, 2023

michellejzh commented Jun 2, 2020 •

edited

Loading

Output of `pd.show_versions()`