Skip to content

REGR: outer merge resulting in missing datetime values converts to int #31157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vadella opened this issue Jan 20, 2020 · 5 comments
Closed

REGR: outer merge resulting in missing datetime values converts to int #31157

vadella opened this issue Jan 20, 2020 · 5 comments
Labels
Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@vadella
Copy link

vadella commented Jan 20, 2020

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame(
    {
        "x": [0, 1],
        "date": pd.to_datetime(["2020-01-05T12:00", "2020-01-05T13:00"]),
    }
)
df2 = pd.DataFrame({"x": [0, 2], "y": [0, 1]})

merged = pd.merge(df, df2, on="x", how="outer")
merged.info()
print(merged)
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 3 entries, 0 to 2
    Data columns (total 3 columns):
     #   Column  Non-Null Count  Dtype  
    ---  ------  --------------  -----  
     0   x       3 non-null      int64  
     1   date    3 non-null      object 
     2   y       2 non-null      float64
    dtypes: float64(1), int64(1), object(1)
    memory usage: 96.0+ bytes
   x   date                 y
0  0   1578225600000000000  0.0
1  1   1578229200000000000  NaN
2  2  -9223372036854775808  1.0
merged["date"].map(type)
    0    <class 'int'>
    1    <class 'int'>
    2    <class 'int'>
    Name: date, dtype: object
print(pd.merge(df, df2, on="x", how="inner"))
   x                date  y
0  0 2020-01-05 12:00:00  0

Problem description

Merging 2 DataFrames which result in missing Datetime values silently converts the column to integers with an object dtype, instead of keeping datetime64[ns] as dtype, which was the behaviour in pandas 0.25.3, and the behaviour I would expect.

Expected Output

merged = pd.merge(df, df2, on="x", how="outer")
merged.info()
print(merged)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 3 columns):
x       3 non-null int64
date    2 non-null datetime64[ns]
y       2 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 96.0 bytes
   x                date    y
0  0 2020-01-05 12:00:00  0.0
1  1 2020-01-05 13:00:00  NaN
2  2                 NaT  1.0
merged["date"].map(type)
    0    <class 'pandas._libs.tslibs.timestamps.Timesta...
    1    <class 'pandas._libs.tslibs.timestamps.Timesta...
    2        <class 'pandas._libs.tslibs.nattype.NaTType'>
    Name: date, dtype: object

This is the output from Python 0.25.3

Whether the y-column should be converted to float or Int64 is open for debate, but not the issue at hand here.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 8.1
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.0rc0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.1.0.post20200119
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

@MarcoGorelli
Copy link
Member

Thanks for the report, will take a look.

Are you interested in investigating / submitting a pull request?

@vadella
Copy link
Author

vadella commented Jan 21, 2020

Interested, yes, capable, I fear not.

I've started looking into the merge code, but quickly lost track of all the moving parts (block managers, concat plans, ..). If I have more time, I will try to delve deeper into this.

@dsaxton
Copy link
Member

dsaxton commented Jan 22, 2020

FWIW, I think it's related to this issue: #25014.

Changing tslibs.iNaT to tslibs.NaT here seems to fix this merge, but apparently breaks a lot of other code: https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/concat.py#L353

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jan 22, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.0 milestone Jan 22, 2020
@jorisvandenbossche jorisvandenbossche changed the title outer merge resulting in missing datetime values converts to int REGR: outer merge resulting in missing datetime values converts to int Jan 22, 2020
@MarcoGorelli
Copy link
Member

Interested, yes, capable, I fear not.

That's OK, we'll take a look. If you want something simpler to get started with, you can look for issues labelled "good first issue"

@TomAugspurger
Copy link
Contributor

Sorry, I missed this. #31208 is a duplicate of it, and #31211 is fixing things.

Closing in favor of #31208.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants