Skip to content

combine_first not retaining dtypes - unmatched indexes #24357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jmarshall9120 opened this issue Dec 19, 2018 · 8 comments
Closed

combine_first not retaining dtypes - unmatched indexes #24357

jmarshall9120 opened this issue Dec 19, 2018 · 8 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jmarshall9120
Copy link

jmarshall9120 commented Dec 19, 2018

df1 = pd.DataFrame([["one", i] for i in range(3)], columns=["a","b"])
df2 = pd.DataFrame([["one", i] for i in range(5)], columns=["a","b"])
df3 = df1.combine_first(df2)
df1.dtypes
df2.dtypes
df3.dtypes

#all below statements should show a dtype of int64 for column b
df1.dtypes
df2.dtypes
df3.dtypes

#Actual Output
df1.dtypes
a    object
b     int64
dtype: object
>>> df2.dtypes
a    object
b     int64
dtype: object
>>> df3.dtypes
a     object
b    float64
dtype: object

Not sure this is intended behavior or not, but as you can see the from the output the dtype of the col b is changed to float64 when combine_first is called.

I've seen an old open issue: combine_first not retaining dtypes

That issue is from 2014 and explains why data types are coerced to float64 when there is a resulting nan, however in the example above there is no "resulting" nan. It could be because "under the hood" there are nans where the first index doesn't match the second index. Still this sort of leaves combine_first in a weird state because if i can't trust dtypes to not be coerced when appending data, then I need to guarantee matching indexes before hand. If i have to do that, I sort of have to do half of the work of combine_first manually, making it far less useful.

Expected Output

df1.dtypes
a    object
b     int64
dtype: object
>>> df2.dtypes
a    object
b     int64
dtype: object
>>> df3.dtypes
a     object
b    int64
dtype: object

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@WillAyd WillAyd added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Dec 27, 2018
@WillAyd
Copy link
Member

WillAyd commented Dec 27, 2018

Thanks for the report. Investigation and PRs are always welcome

@WillAyd WillAyd added this to the Contributions Welcome milestone Dec 27, 2018
@MJuddBooth
Copy link

I encountered this when trying to upgrade to 0.25dev from a 0.21.0dev. The old code, with its more convoluted handling of i8_conversion had the side-effect of retaining the datetime dtype in cases similar to the above example. I have a fix for my case. Looking at the code and the tests, I think a few changes would solve the above problem as well (and a few of the test cases with "FIXME" comments) but that is a little more invasive.

@jmarshall9120
Copy link
Author

jmarshall9120 commented Mar 8, 2019

My exploration into this issue leads me to believe that pandas is altering to float type to avoid the eventuality that I could be combining an int64 with a NULL type (None, or np.nan, etc). The only way I could solve this was to remove the use of 'int64' in my code, which unfortunately means coercing every python int to a float64 and avoiding pandas native type detection. :(

Not a very elegant solution but it did work.

@OliverHofkens
Copy link

I think I've managed to reproduce and track down this issue:

import pandas as pd
import numpy as np

df_1 = pd.DataFrame(
    {
        "a": [np.nan, 2, 3]
    }
)
df_2 = pd.DataFrame(
    {
        "a": [1, 2, np.nan],
        "b": ["0123", "0123", "0123"]
    }
)

df_1.combine_first(df_2)
a b
0 1.0 123.0
1 2.0 123.0
2 3.0 123.0

As you can see, column "b" was transformed from string(/object) to float by combine_first.

Here's what happens:

I believe my case would be fixed by only calling maybe_downcast_to_dtype if the column was present in the original DataFrame.

@giuliobeseghi
Copy link

The same happens with unaligned series without nans:

import pandas as pd

s = pd.Series([1, 2, 3, 4, 5])
t = pd.Series([10, 11, 12], index=range(3, 6))

t.combine_first(s)
x y
0 1.0
1 2.0
2 3.0
3 10.0
4 11.0
5 12.0

The series shouldn't have any floats

[sorry for i-v, I didn't know how to render a pandas Series in a markdown document]

@jmarshall9120
Copy link
Author

@giuliobeseghi

Your t Series has nan's at index 1 & 2. That's why your having the same result as me.

@giuliobeseghi
Copy link

@giuliobeseghi

Your t Series has nan's at index 1 & 2. That's why your having the same result as me.

Yes, it was to point out that happens both with series and dataframes (not sure if it's useful information)

@TomAugspurger
Copy link
Contributor

I think this is the same root issue as #7509, so closing in favor of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

6 participants