Skip to content

BUG: Change of behavior in casting of datetime-like types in MultiIndex #43091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
jgmarcel opened this issue Aug 18, 2021 · 17 comments
Closed
2 of 3 tasks

BUG: Change of behavior in casting of datetime-like types in MultiIndex #43091

jgmarcel opened this issue Aug 18, 2021 · 17 comments
Labels
Bug datetime.date stdlib datetime.date support

Comments

@jgmarcel
Copy link

jgmarcel commented Aug 18, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

[Edited to inform a much simpler example.]

import datetime
import pandas as pd

print(f"Pandas version:\t{pd.__version__}\n")

df = pd.DataFrame({'date': [datetime.date(2021, 8, 1),
                            datetime.date(2021, 8, 2),
                            datetime.date(2021, 8, 3)],
                   'ticker': ['aapl', 'goog', 'yhoo'],
                   'value': [5.63269, 4.45609, 2.74843]})

df.set_index(['date', 'ticker'], inplace=True)

print(df.index.get_level_values(0))

Output

The output below has been generated with pandas 1.3.0 or higher.

Pandas version:	1.3.0

Index([2021-08-01, 2021-08-02, 2021-08-03], dtype='object', name='date')

Expected Output

The output below has been generated with pandas 1.2.5.

Pandas version:	1.2.5

DatetimeIndex(['2021-08-01', '2021-08-02', '2021-08-03'], dtype='datetime64[ns]', name='date', freq=None)

Problem description

Starting from pandas 1.3.0, the observed behavior changed: in a MultiIndex creation, datetime.date objects are not cast to datetime64 anymore. I fail to find in the What’s new page the reason for that change of behavior. Is it by design or a bug?

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 5f648bf1706dd75a9ca0d29f26eadfbb595fe52b
python           : 3.9.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.2
numpy            : 1.21.2
pytz             : 2021.1
dateutil         : 2.8.2
pip              : 21.2.4
setuptools       : 57.4.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.9.1 (dt dec pq3 ext lo64)
jinja2           : None
IPython          : 7.26.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : 1.4.22
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@jgmarcel jgmarcel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 18, 2021
@jbrockmendel jbrockmendel added the datetime.date stdlib datetime.date support label Aug 18, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 19, 2021
@simonjayhawkins
Copy link
Member

Thanks @jgmarcel for the report.

first bad commit: [545a942] BUG: Index([date]).astype("category").astype(object) roundtrip (#38552)

I'll mark as a regression for now pending further investigation.

Note that the set_index followed by a reset_index still creates a datetime64[ns] column from the original object column of date objects.

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 19, 2021
@jbrockmendel
Copy link
Member

Note that the set_index followed by a reset_index still creates a datetime64[ns] column from the original object column of date objects.

#38552 was about round-tripping faithfully, so im inclined to think that reset_index should also roundtrip.

@jgmarcel support for date objects in generally is spotty. Your best bet is to use datetime or Timestamp objects

@jgmarcel
Copy link
Author

@jgmarcel support for date objects in generally is spotty. Your best bet is to use datetime or Timestamp objects

Agreed. Thank you for the good advice. Problem is data objects are all over our code, since anytime a DATE column is read via pandas.read_sql(), it is translated to a series of date objects. Hence, casting everything to datetime or Timestamp objects would be a lot of effort… For now, we reverted to version 1.2.5, but I would like to avoid the work if possible.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@jgmarcel
Copy link
Author

jgmarcel commented Oct 8, 2021

Hi guys,

Do you believe it will be possible to get this fixed on version 1.3.4? I ask so I can better plan the effort of adapting our whole code base.

Thank you.

@jreback
Copy link
Contributor

jreback commented Oct 8, 2021

@jgmarcel would likely take a community pull request

core can provide review

might be quite tricky as date have very little support

@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@jreback
Copy link
Contributor

jreback commented Oct 28, 2021

pls look at the 1.3.x release notes. IIRC this was changed on purpose in response to unforced conversion (e.g. to datetime times) it was desirable to keep datetime.date.

@jgmarcel
Copy link
Author

jgmarcel commented Oct 28, 2021

Hi @jreback,

pls look at the 1.3.x release notes. IIRC this was changed on purpose in response to unforced conversion (e.g. to datetime times) it was desirable to keep datetime.date.

As initially stated, «I fail to find in the What’s new page the reason for that change of behavior». Would you mind pointing it out to me?

Anyway, the version policy clearly states that «API breaking changes should only occur in major releases», and «a deprecation path will be provided rather than an outright breaking change». Wouldn’t that be the case here, since we are talking about a breaking change that occurred between versions 1.2.5 and 1.3.0?

cc @simonjayhawkins

@simonjayhawkins
Copy link
Member

in 1.2.5..

pd.Index(
    [
        datetime.date(2021, 8, 1),
        datetime.date(2021, 8, 2),
        datetime.date(2021, 8, 3),
    ]
)

gives

Index([2021-08-01, 2021-08-02, 2021-08-03], dtype='object')

and using the DataFrame from the OP

df.set_index(["date"]).index

gives

Index([2021-08-01, 2021-08-02, 2021-08-03], dtype='object', name='date')

whereas for a MultiIndex

arr = [
    datetime.date(2021, 8, 1),
    datetime.date(2021, 8, 2),
    datetime.date(2021, 8, 3),
]
pd.MultiIndex.from_arrays([arr, arr]).levels[0]

gives

DatetimeIndex(['2021-08-01', '2021-08-02', '2021-08-03'], dtype='datetime64[ns]', freq=None)

So the Index and MultiIndex constructors were inconsistent in the handling of object dtype arrays containing datetime objects in pandas 1.2.5.

As initially stated, «I fail to find in the What’s new page the reason for that change of behavior». Would you mind pointing it out to me?

The change of behavior in casting of datetime-like types in MultiIndex was done in #38552. Looking at the code changes in that PR, it is clear from the changed tests and comments added that this change was intentional. Unfortunately the release note added did not refer to changes in MultiIndex construction.

Anyway, the version policy clearly states that «API breaking changes should only occur in major releases», and «a deprecation path will be provided rather than an outright breaking change». Wouldn’t that be the case here, since we are talking about a breaking change that occurred between versions 1.2.5 and 1.3.0?

The policy also states

pandas will sometimes make behavior changing bug fixes, as part of minor or patch releases. Whether or not a change is a bug fix or an API-breaking change is a judgement call. We’ll do our best, and we invite you to participate in development discussion on the issue tracker or mailing list.

So the change in behavior could be considered a bug fix, since the MultiIndex constructor was inconsistent with the Index constructor and no further action.

However, the policy also states

Whenever possible, a deprecation path will be provided rather than an outright breaking change.

and

We will not introduce new deprecations in patch releases.

So, as an alternative, we could maybe restore the old behavior for 1.3.5 and add a deprecation of this behavior in 1.4

The only code change in #38552 was removing convert_dates=True from values = maybe_infer_to_datetimelike(values, convert_dates=True)

I guess we could maybe pass a convert_dates parameter through to the Categorical constructor from the MultiIndex constructor. @jbrockmendel wdyt?

@jreback
Copy link
Contributor

jreback commented Nov 4, 2021

-1 on any change here
we have very limited if any support for datetime.date

not adding more complexity

@jgmarcel
Copy link
Author

jgmarcel commented Nov 4, 2021

Hi @simonjayhawkins,

Your thorough explanation is very much appreciated. I see now how the change in behavior was more of a bug fix than an API-breaking change.

If I may pick your brain here, what do you believe would be the best way to achieve the pre-1.3 behavior, i.e. having datetime.date objects cast to datetime64 in MultiIndex? Would it be to call pd.to_datetime(arg, errors='ignore') where arg takes every column of the DataFrame (since I do not know in advance what its dtypes are)? Would it be saner to do that conversion immediately before or immediately after calling the MultiIndex constructor? Any other solution?

Thank you.

@jbrockmendel
Copy link
Member

I guess we could maybe pass a convert_dates parameter through to the Categorical constructor from the MultiIndex constructor. @jbrockmendel wdyt?

It's possible. Though we'd then have a breaking change for anyone relying on the 1.3 behavior.

Would it be to call pd.to_datetime(arg, errors='ignore') where arg takes every column of the DataFrame

I'd check Index(col).inferred_type == "date"

@jgmarcel
Copy link
Author

jgmarcel commented Nov 4, 2021

I'd check Index(col).inferred_type == "date"

That would be a great addition, yes. Thank you for that! However, I would also have to check for an inferred type of mixed, for when my column of datetime.date objects contains null dates, right?

@simonjayhawkins
Copy link
Member

removing this issue from the 1.3.5 milestone as I think the consensus is for no action.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.5, No action Nov 18, 2021
@simonjayhawkins simonjayhawkins removed the Regression Functionality that used to work in a prior pandas version label Nov 18, 2021
@simonjayhawkins
Copy link
Member

@jbrockmendel if you can repsond to #43091 (comment) we can probably close this issue. Thanks.

@jbrockmendel
Copy link
Member

However, I would also have to check for an inferred type of mixed, for when my column of datetime.date objects contains null dates, right?

I'd go for lib.infer_dtype(col, skipna=True) == "date" instead of checking for "mixed"

@jgmarcel
Copy link
Author

However, I would also have to check for an inferred type of mixed, for when my column of datetime.date objects contains null dates, right?

I'd go for lib.infer_dtype(col, skipna=True) == "date" instead of checking for "mixed"

Thank you very much! I was not aware of that function. Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug datetime.date stdlib datetime.date support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants