-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Series.combine_first
replaces None
with NaN
at index not provided by other
#58977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would like to take this up. Working on a fix now |
I can reproduce the issue locally and the problem seems to originate from: Line 3149 in bbe0e53
The root cause of the issue:
My fix is going to be: |
@mroeschke I am tagging you here because you have previously reviewed the code for the combine_first function and I assume you might be maintaining it. I read quite a bit about Pandas convention of representing NA values for this issue and I stumbled upon this StackOverflow which is inspired by the official docs. The problem arises because Pandas considers None and NaN the same, take for example the isna() or isnull() utility. Based on the resources I read above, Pandas has decided to use NaN as its default NA representation. Based on these points, I feel the best course of action is to convert the None values to NaN and then invoke the combine_first function. This will add O(N) computational overhead given the fillna() operation. @michaelpradel If you have a better alternative please let me know! |
@pandyah5: Your proposed fix sounds good to me, at least for series with (inferred) dtype "object". Note that I'm not a pandas developers, just a user; so please take my opinion on this with a grain of salt. |
Hi @michaelpradel thanks for your feedback! Its been a while since this issue has been opened so I'll bring this up in the community and get some maintainers attention here to have their final verdict :) Update: I have notified the pandas-dev slack channel for some help in reviewing this PR |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The documentation of
Series.combine_first
states:In the above example,
s2
doesn't provide any value at index'a'
, socombine_first
should not affect this index.However, the result with pandas 2.2.2 is the following, which unexpectedly changes the value at
'a'
fromNone
toNaN
:Pandas 2.1.4 behaves as expected, so this looks like a regression.
The behavior was changed with this PR: #57034
Expected Behavior
The expected output is:
Installed Versions
pandas : 2.2.0rc0+28.g6dbeeb4009
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 63.2.0
pip : 24.0
Cython : 3.0.10
pytest : 8.2.0
hypothesis : 6.100.2
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : 3.2.0
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.24.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.8
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 16.0.0
pyreadstat : 1.2.7
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.3.1
scipy : 1.13.0
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: