Skip to content

BUG: dt.tz_convert() on categorical[datetime] resets index #43080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
tscheburaschka opened this issue Aug 17, 2021 · 3 comments · Fixed by #43226
Closed
2 of 3 tasks

BUG: dt.tz_convert() on categorical[datetime] resets index #43080

tscheburaschka opened this issue Aug 17, 2021 · 3 comments · Fixed by #43226
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@tscheburaschka
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

import itertools
import pandas as pd

ts_list = pd.date_range(start='2021-01-01T02:00:00', periods=3, freq='1D').to_list()
df = pd.DataFrame(data=list(itertools.islice(itertools.cycle(ts_list), 9)),
                  index=[2, 4, 5, 6, 7, 8, 11, 14, 15],
                  columns=['pure_datetime'])
df['categorical_datetime'] = df['pure_datetime'].astype(pd.CategoricalDtype())

"""
>>> df
         pure_datetime categorical_datetime
2  2021-01-01 02:00:00  2021-01-01 02:00:00
4  2021-01-02 02:00:00  2021-01-02 02:00:00
5  2021-01-03 02:00:00  2021-01-03 02:00:00
6  2021-01-01 02:00:00  2021-01-01 02:00:00
7  2021-01-02 02:00:00  2021-01-02 02:00:00
8  2021-01-03 02:00:00  2021-01-03 02:00:00
11 2021-01-01 02:00:00  2021-01-01 02:00:00
14 2021-01-02 02:00:00  2021-01-02 02:00:00
15 2021-01-03 02:00:00  2021-01-03 02:00:00

>>> df.dtypes
pure_datetime           datetime64[ns]
categorical_datetime          category
dtype: object
"""

# Now apply some function from .dt namespace.
"""
>>> df['pure_datetime'].dt.tz_localize('Europe/Berlin')
2    2021-01-01 02:00:00+01:00
4    2021-01-02 02:00:00+01:00
5    2021-01-03 02:00:00+01:00
6    2021-01-01 02:00:00+01:00
7    2021-01-02 02:00:00+01:00
8    2021-01-03 02:00:00+01:00
11   2021-01-01 02:00:00+01:00
14   2021-01-02 02:00:00+01:00
15   2021-01-03 02:00:00+01:00
Name: pure_datetime, dtype: datetime64[ns, Europe/Berlin]
"""
# All is fine with pure datetime column

# BUT, when applied to categorical column, the index is reset:
"""
>>> df['categorical_datetime'].dt.tz_localize('Europe/Berlin')
0   2021-01-01 02:00:00+01:00
1   2021-01-02 02:00:00+01:00
2   2021-01-03 02:00:00+01:00
3   2021-01-01 02:00:00+01:00
4   2021-01-02 02:00:00+01:00
5   2021-01-03 02:00:00+01:00
6   2021-01-01 02:00:00+01:00
7   2021-01-02 02:00:00+01:00
8   2021-01-03 02:00:00+01:00
Name: categorical_datetime, dtype: datetime64[ns, Europe/Berlin]
"""

# Naive construction of new columns yields convoluted results
df['localized_pure'] = df['pure_datetime'].dt.tz_localize('Europe/Berlin')
df['localized_categorical'] = df['categorical_datetime'].dt.tz_localize('Europe/Berlin')
"""
>>> df
         pure_datetime categorical_datetime            localized_pure     localized_categorical
2  2021-01-01 02:00:00  2021-01-01 02:00:00 2021-01-01 02:00:00+01:00 2021-01-03 02:00:00+01:00
4  2021-01-02 02:00:00  2021-01-02 02:00:00 2021-01-02 02:00:00+01:00 2021-01-02 02:00:00+01:00
5  2021-01-03 02:00:00  2021-01-03 02:00:00 2021-01-03 02:00:00+01:00 2021-01-03 02:00:00+01:00
6  2021-01-01 02:00:00  2021-01-01 02:00:00 2021-01-01 02:00:00+01:00 2021-01-01 02:00:00+01:00
7  2021-01-02 02:00:00  2021-01-02 02:00:00 2021-01-02 02:00:00+01:00 2021-01-02 02:00:00+01:00
8  2021-01-03 02:00:00  2021-01-03 02:00:00 2021-01-03 02:00:00+01:00 2021-01-03 02:00:00+01:00
11 2021-01-01 02:00:00  2021-01-01 02:00:00 2021-01-01 02:00:00+01:00                       NaT
14 2021-01-02 02:00:00  2021-01-02 02:00:00 2021-01-02 02:00:00+01:00                       NaT
15 2021-01-03 02:00:00  2021-01-03 02:00:00 2021-01-03 02:00:00+01:00                       NaT
"""

Problem description

When working on categorical columns, I would expect all methods, that can be called without error or warning, to yield the same results as when applied to the pure (category-expanded) columns. Some methods from the .dt-namespace (at least tz_localize() and tz_convert()) seem to return non-consistent results.
In particular, they reset the index of the input column (a Series in fact).

Expected Output

I expect the method .dt.tz_localize() to not reset the index when applied to a Series of datetime categoricals.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 0, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 57.4.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@tscheburaschka tscheburaschka added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2021
@deepanshu-Raj
Copy link

@tscheburaschka Something that i observed is :

pd.Series(df['pure_datetime'],dtype=pd.CategoricalDtype())
Used here :
df['categorical_datetime'] = df['pure_datetime'].astype(pd.CategoricalDtype())
yields output column (categorical_datetime) with original indexing i.e. [2, 4, 5, 6, 7, 8, 11, 14, 15],
hence, the mapping obtained fits in the DataFrame.

while,
pd.Series(df['pure_datetime'].values,dtype=pd.CategoricalDtype())
yields output column (categorical_datetime) with index [0,1,2,3,4,5,6,7,8]

and, since df['categorical_datetime'].dt.tz_localize('Europe/Berlin') considers the second mapping out of the two mentioned, and indexes are only available till 8, hence the bottom 3 index comes out to be NaT.

@mroeschke mroeschke added Categorical Categorical Data Type Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 22, 2021
@tscheburaschka
Copy link
Author

@deepanshu-Raj

and, since df['categorical_datetime'].dt.tz_localize('Europe/Berlin') considers the second mapping out of the two mentioned,

Did you get that from inspecting the pandas source code? Otherwise, I would expect df[<any column>].dt.<some func>() to always return a Series with the original index of df, such that assignments of the form
df[<new column>] = df[<existing column>].dt.<some func>()
align with the existing index and work as expected.

@kurchi1205
Copy link
Contributor

kurchi1205 commented Aug 26, 2021

@mroeschke
Can I solve this bug?
Refering to #28300
By including index , when dealing with categorical data, it can be solved .

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 31, 2021
@jreback jreback added this to the 1.4 milestone Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants