Skip to content

bugs when groupby/aggregate on columns with datetime64[ns, timezone] #12898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sdementen opened this issue Apr 14, 2016 · 4 comments
Closed

bugs when groupby/aggregate on columns with datetime64[ns, timezone] #12898

sdementen opened this issue Apr 14, 2016 · 4 comments
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Timezones Timezone data dtype

Comments

@sdementen
Copy link
Contributor

When aggregating a column with dtype datetime64[ns, timezone], pandas does not handle properly tz information. It looks like as if pandas was dropping the timezone info (ie doing its calculation on the UTC values) and then localizing the date to the original timezone,i.e. doing something like
column_with_UTC_dates_in_naive_format.tz_localize(timezone)
instead of
column_with_UTC_dates_in_naive_format.tz_localize("UTC").tz_convert(timezone)

This buggy behavior is even more striking when running the sample with
idx = pandas.date_range("2016-03-27", "2016-03-29", freq="H", closed="left", tz="Europe/Brussels") (cover the DST change) as the naive UTC date "2016-03-27 03:00" cannot be localized to the timezone as it is not recognized. In this case, the dtype of the column is not anymore datetime64[ns, timezone] but just int64.

Code Sample, a copy-pastable example if possible

# create a DataFrame with a "time" column filled with localized datetime64 on hourly basis
idx = pandas.date_range("2016-01-01", "2016-01-03", freq="H", closed="left",tz="Europe/Brussels")
df = pandas.Series(idx, index=idx, name="time").to_frame()
# calculate some min and max of "time" column per hour of the day
df_agg = df.groupby(idx.hour).aggregate(["min","max"])
df_agg.info()
print df_agg

Expected Output

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2016-01-01 00:00:00+01:00 to 2016-01-02 23:00:00+01:00
Freq: H
Data columns (total 1 columns):
time    48 non-null datetime64[ns, Europe/Brussels]
dtypes: datetime64[ns, Europe/Brussels](1)
memory usage: 768.0 bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 2 columns):
(time, min)    24 non-null datetime64[ns, Europe/Brussels]
(time, max)    24 non-null datetime64[ns, Europe/Brussels]
dtypes: datetime64[ns, Europe/Brussels](2)
memory usage: 576.0 bytes
                        time                          
                         min                       max
0  2015-12-31 23:00:00+01:00 2016-01-01 23:00:00+01:00
1  2016-01-01 00:00:00+01:00 2016-01-02 00:00:00+01:00
2  2016-01-01 01:00:00+01:00 2016-01-02 01:00:00+01:00
3  2016-01-01 02:00:00+01:00 2016-01-02 02:00:00+01:00
4  2016-01-01 03:00:00+01:00 2016-01-02 03:00:00+01:00
5  2016-01-01 04:00:00+01:00 2016-01-02 04:00:00+01:00
6  2016-01-01 05:00:00+01:00 2016-01-02 05:00:00+01:00
7  2016-01-01 06:00:00+01:00 2016-01-02 06:00:00+01:00
8  2016-01-01 07:00:00+01:00 2016-01-02 07:00:00+01:00
9  2016-01-01 08:00:00+01:00 2016-01-02 08:00:00+01:00
10 2016-01-01 09:00:00+01:00 2016-01-02 09:00:00+01:00
11 2016-01-01 10:00:00+01:00 2016-01-02 10:00:00+01:00
12 2016-01-01 11:00:00+01:00 2016-01-02 11:00:00+01:00
13 2016-01-01 12:00:00+01:00 2016-01-02 12:00:00+01:00
14 2016-01-01 13:00:00+01:00 2016-01-02 13:00:00+01:00
15 2016-01-01 14:00:00+01:00 2016-01-02 14:00:00+01:00
16 2016-01-01 15:00:00+01:00 2016-01-02 15:00:00+01:00
17 2016-01-01 16:00:00+01:00 2016-01-02 16:00:00+01:00
18 2016-01-01 17:00:00+01:00 2016-01-02 17:00:00+01:00
19 2016-01-01 18:00:00+01:00 2016-01-02 18:00:00+01:00
20 2016-01-01 19:00:00+01:00 2016-01-02 19:00:00+01:00
21 2016-01-01 20:00:00+01:00 2016-01-02 20:00:00+01:00
22 2016-01-01 21:00:00+01:00 2016-01-02 21:00:00+01:00
23 2016-01-01 22:00:00+01:00 2016-01-02 22:00:00+01:00

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.6.7
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2016.1
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
None

@jreback
Copy link
Contributor

jreback commented Apr 14, 2016

so your example is confusing, because the expected output is what you are getting, not what you EXPECT to get. can you update. I would also have a much simpler example which illustrates.

@jreback jreback added this to the Next Major Release milestone Apr 14, 2016
@jreback jreback added the Bug label Apr 14, 2016
@jreback
Copy link
Contributor

jreback commented Apr 14, 2016

so this is manifesting as a different version of #10668

@vladu
Copy link
Contributor

vladu commented Sep 20, 2016

Here's a shorter demonstration of the same behavior:

>>> import pandas
>>> s = pandas.Series(pandas.date_range('2015-10-07 10:30:00', periods=1, tz='America/New_York'))
>>> s
0   2015-10-07 10:30:00-04:00
dtype: datetime64[ns, America/New_York]

>>> s.groupby(s.dt.year).max()
2015   2015-10-07 14:30:00-04:00
dtype: datetime64[ns, America/New_York]

Note how the time changes between the original series and the grouby+max transformation.

@jreback jreback added the Duplicate Report Duplicate issue or pull request label Sep 20, 2016
@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

ok, as I said above this is a dupe of the same behavior as in #10668 .

@jreback jreback closed this as completed Sep 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

3 participants