Skip to content

series.dt.tz_localize() on Categorical operates on categories, not values #27952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamhooper opened this issue Aug 16, 2019 · 3 comments · Fixed by #28300
Closed

series.dt.tz_localize() on Categorical operates on categories, not values #27952

adamhooper opened this issue Aug 16, 2019 · 3 comments · Fixed by #28300
Labels
Categorical Categorical Data Type Datetime Datetime data dtype Timezones Timezone data dtype
Milestone

Comments

@adamhooper
Copy link
Contributor

Code Sample, a copy-pastable example if possible

datetimes = pd.Series(['2019-01-01', '2019-01-01', '2019-01-02'], dtype='datetime64[ns]')
categorical = datetimes.astype('category')
categorical.dt.tz_localize(None)

Produces:

0   2019-01-01
1   2019-01-02
dtype: datetime64[ns]

Problem description

.dt.tz_localize() is operating on categorical.cat.categories. It should be operating on categorical.astype('datetime64[ns]').values. This is just plain wrong.

Expected Output

According to Categorical docs, "The returned Series (or DataFrame) is of the same type as if you used the .str. / .dt. on a Series of that type (and not of type category!).". So I think the expected value to be:

>>> datetimes.dt.tz_localize(None)
0   2019-01-01
1   2019-01-01
2   2019-01-02
dtype: datetime64[ns]

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.2.final.0 python-bits : 64 OS : Linux OS-release : 5.2.8-200.fc30.x86_64 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.17.0
pytz : 2019.2
dateutil : 2.8.0
pip : 19.0.2
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.0
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.3.0
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

adamhooper added a commit to CJWorkbench/convert-date that referenced this issue Aug 16, 2019
Found a bug, pandas-dev/pandas#27952, which
made our module output wrong results when there are lots of duplicates.
We work around the wrong Pandas behavior by writing a special code path
that avoids the buggy code.

[finishes #167858839]
@mroeschke mroeschke added Categorical Categorical Data Type Datetime Datetime data dtype Timezones Timezone data dtype labels Aug 16, 2019
@mroeschke
Copy link
Member

The line that needs fixing is here:

data = Series(orig.values.categories, name=orig.name, copy=False)

PR's welcome!

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Aug 20, 2019
@jbrockmendel
Copy link
Member

@mroeschke are you sure thats the problem? I think that might be a problem, but it looks like _delegate_method might be doing something weird too

@mroeschke
Copy link
Member

When I tried debugging quickly that's the first spot where the behavior was incorrect. You're probably right in that the behavior is also incorrect up/down stream somewhere

@jreback jreback modified the milestones: Contributions Welcome, 1.0 Nov 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants