REGR: outer merge resulting in missing datetime values converts to int #31157

vadella · 2020-01-20T17:42:39Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame(
    {
        "x": [0, 1],
        "date": pd.to_datetime(["2020-01-05T12:00", "2020-01-05T13:00"]),
    }
)
df2 = pd.DataFrame({"x": [0, 2], "y": [0, 1]})

merged = pd.merge(df, df2, on="x", how="outer")
merged.info()
print(merged)

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 3 entries, 0 to 2
    Data columns (total 3 columns):
     #   Column  Non-Null Count  Dtype  
    ---  ------  --------------  -----  
     0   x       3 non-null      int64  
     1   date    3 non-null      object 
     2   y       2 non-null      float64
    dtypes: float64(1), int64(1), object(1)
    memory usage: 96.0+ bytes

   x   date                 y
0  0   1578225600000000000  0.0
1  1   1578229200000000000  NaN
2  2  -9223372036854775808  1.0

merged["date"].map(type)

    0    <class 'int'>
    1    <class 'int'>
    2    <class 'int'>
    Name: date, dtype: object

print(pd.merge(df, df2, on="x", how="inner"))

   x                date  y
0  0 2020-01-05 12:00:00  0

Problem description

Merging 2 DataFrames which result in missing Datetime values silently converts the column to integers with an object dtype, instead of keeping datetime64[ns] as dtype, which was the behaviour in pandas 0.25.3, and the behaviour I would expect.

Expected Output

merged = pd.merge(df, df2, on="x", how="outer")
merged.info()
print(merged)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 3 columns):
x       3 non-null int64
date    2 non-null datetime64[ns]
y       2 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 96.0 bytes
   x                date    y
0  0 2020-01-05 12:00:00  0.0
1  1 2020-01-05 13:00:00  NaN
2  2                 NaT  1.0

merged["date"].map(type)

    0    <class 'pandas._libs.tslibs.timestamps.Timesta...
    1    <class 'pandas._libs.tslibs.timestamps.Timesta...
    2        <class 'pandas._libs.tslibs.nattype.NaTType'>
    Name: date, dtype: object

This is the output from Python 0.25.3

Whether the y-column should be converted to float or Int64 is open for debate, but not the issue at hand here.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 8.1
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.0rc0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.1.0.post20200119
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-01-21T12:46:12Z

Thanks for the report, will take a look.

Are you interested in investigating / submitting a pull request?

vadella · 2020-01-21T17:04:56Z

Interested, yes, capable, I fear not.

I've started looking into the merge code, but quickly lost track of all the moving parts (block managers, concat plans, ..). If I have more time, I will try to delve deeper into this.

dsaxton · 2020-01-22T01:54:17Z

FWIW, I think it's related to this issue: #25014.

Changing tslibs.iNaT to tslibs.NaT here seems to fix this merge, but apparently breaks a lot of other code: https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/concat.py#L353

MarcoGorelli · 2020-01-22T09:02:38Z

Interested, yes, capable, I fear not.

That's OK, we'll take a look. If you want something simpler to get started with, you can look for issues labelled "good first issue"

TomAugspurger · 2020-01-22T19:36:21Z

Sorry, I missed this. #31208 is a duplicate of it, and #31211 is fixing things.

Closing in favor of #31208.

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jan 22, 2020

jorisvandenbossche added this to the 1.0.0 milestone Jan 22, 2020

jorisvandenbossche changed the title ~~outer merge resulting in missing datetime values converts to int~~ REGR: outer merge resulting in missing datetime values converts to int Jan 22, 2020

TomAugspurger mentioned this issue Jan 22, 2020

RLS: 1.0.0 #27492

Closed

TomAugspurger closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: outer merge resulting in missing datetime values converts to int #31157

REGR: outer merge resulting in missing datetime values converts to int #31157

vadella commented Jan 20, 2020

INSTALLED VERSIONS

MarcoGorelli commented Jan 21, 2020

vadella commented Jan 21, 2020

dsaxton commented Jan 22, 2020

MarcoGorelli commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

REGR: outer merge resulting in missing datetime values converts to int #31157

REGR: outer merge resulting in missing datetime values converts to int #31157

Comments

vadella commented Jan 20, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MarcoGorelli commented Jan 21, 2020

vadella commented Jan 21, 2020

dsaxton commented Jan 22, 2020

MarcoGorelli commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

Output of `pd.show_versions()`