-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
combine_first not retaining dtypes - unmatched indexes #24357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Investigation and PRs are always welcome |
I encountered this when trying to upgrade to 0.25dev from a 0.21.0dev. The old code, with its more convoluted handling of i8_conversion had the side-effect of retaining the datetime dtype in cases similar to the above example. I have a fix for my case. Looking at the code and the tests, I think a few changes would solve the above problem as well (and a few of the test cases with "FIXME" comments) but that is a little more invasive. |
My exploration into this issue leads me to believe that pandas is altering to Not a very elegant solution but it did work. |
I think I've managed to reproduce and track down this issue: import pandas as pd
import numpy as np
df_1 = pd.DataFrame(
{
"a": [np.nan, 2, 3]
}
)
df_2 = pd.DataFrame(
{
"a": [1, 2, np.nan],
"b": ["0123", "0123", "0123"]
}
)
df_1.combine_first(df_2)
As you can see, column "b" was transformed from string(/object) to float by Here's what happens:
I believe my case would be fixed by only calling |
The same happens with unaligned series without nans: import pandas as pd
s = pd.Series([1, 2, 3, 4, 5])
t = pd.Series([10, 11, 12], index=range(3, 6))
t.combine_first(s)
The series shouldn't have any floats [sorry for i-v, I didn't know how to render a pandas Series in a markdown document] |
Your t Series has nan's at index 1 & 2. That's why your having the same result as me. |
Yes, it was to point out that happens both with series and dataframes (not sure if it's useful information) |
I think this is the same root issue as #7509, so closing in favor of that. |
Not sure this is intended behavior or not, but as you can see the from the output the dtype of the col
b
is changed tofloat64
whencombine_first
is called.I've seen an old open issue: combine_first not retaining dtypes
That issue is from 2014 and explains why data types are coerced to
float64
when there is a resultingnan
, however in the example above there is no "resulting"nan
. It could be because "under the hood" there arenan
s where the first index doesn't match the second index. Still this sort of leaves combine_first in a weird state because if i can't trust dtypes to not be coerced when appending data, then I need to guarantee matching indexes before hand. If i have to do that, I sort of have to do half of the work of combine_first manually, making it far less useful.Expected Output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: