Skip to content

BUG: where(...) unexpectedly casts None to np.NA for series with pd.Int64Dtype #34536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
michellejzh opened this issue Jun 2, 2020 · 8 comments
Closed
2 of 3 tasks
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@michellejzh
Copy link

michellejzh commented Jun 2, 2020

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • (optional) I have confirmed this bug exists on the master branch of pandas.

Problem description

When trying to replace all pd.NA values with None using where(...) on a pd.Series with dtype pd.Int64Dtype, None values are automatically coerced to pd.NA in a way inconsistent with other dtypes. Additionally, the try_cast flag doesn't control this behavior.

Expected Output

I expect that when try_cast=False, None will not be coerced into np.NA, and the dtype of the series will instead be cast into object to accommodate the newly introduced None value, as occurs for other dtypes.

Example

In the following:

s = pd.Series([1, 2, 3, pd.NA], dtype=pd.Int64Dtype())
s.where(pd.notnull(s), None)

the None I'm trying to replace pd.NA with is automatically cast back into pd.NA. This occurs even though try_cast is set to False by default, which I would expect to control this casting behavior.

0       1
1       2
2       3
3    <NA>
dtype: Int64

This is in contrast to the behavior of other dtypes, like numpy's float dtype:

s = pd.Series([1, 2, 3, np.nan], dtype=np.dtype("float64"))
s.where(pd.notnull(s), None)

which outputs the series with the dtype coerced to object after incorporating None:

0       1
1       2
2       3
3    None
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.0.4
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@michellejzh michellejzh added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2020
@dsaxton dsaxton added Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2020
@jorisvandenbossche
Copy link
Member

If you convert it to object dtype first, then it actually replaces NA with None:

>>> s.astype(object).where(pd.notnull(s), None)   
0       1
1       2
2       3
3    None
dtype: object

Personally, I might find it good behaviour that it actually by default tries to preserve the dtype. But it's certainly inconsistent with how it works for other dtypes.

@jorisvandenbossche
Copy link
Member

The question is a bit: what values are considered "valid" for the current dtype, and what not?

Because for example for the numpy integer dtype, we also coerce the other if possible to some extent.
Here I provide a float, but it still results in an int series:

In [51]: s = pd.Series([1, 2, 3])

In [52]: s.where(s == 1, 2.0)
Out[52]: 
0    1
1    2
2    2
dtype: int64

And only when the value cannot be interpreted as an int, it gets upcasted to float or object dtype.

Now, for the nullable integer dtype, we actually raise an error if it doesn't fit in integers:

In [62]: s = pd.Series([1, 2, 3], dtype="Int64")

In [63]: s.where(s == 1, 2.5) 
...
TypeError: cannot safely cast non-equivalent object to int64

(so in that light coercing None to NA certainly makes sense)

@jorisvandenbossche
Copy link
Member

Related where issue: #24144

@jbrockmendel
Copy link
Member

@jorisvandenbossche is right, the solution here is to cast to object dtype

@phofl
Copy link
Member

phofl commented Jan 19, 2023

Closing as expected behavior, masked dtypes don't upcast when setting incompatible values

@phofl phofl closed this as completed Jan 19, 2023
@jbrockmendel
Copy link
Member

I think this is worth keeping open, as the lack of consistency with other dtypes is a pretty common complaint.

@phofl
Copy link
Member

phofl commented Jan 19, 2023

I'd rather open an issue about documenting this? Was planning on doing this anyway after the rc for 2.0 is out

@jbrockmendel
Copy link
Member

The issue isnt the docs its the desired behavior. I'd be open to changing/deprecating the non-nullable behavior to not cast, but think we should be consistent throughout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

5 participants