Skip to content

BUG: timezone lost in groupby-agg with cython functions #15426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
munierSalem opened this issue Feb 16, 2017 · 10 comments
Closed

BUG: timezone lost in groupby-agg with cython functions #15426

munierSalem opened this issue Feb 16, 2017 · 10 comments
Labels
Bug Groupby Timezones Timezone data dtype
Milestone

Comments

@munierSalem
Copy link

munierSalem commented Feb 16, 2017

xref #10668 (for more examples)

Hello!

I'm running into some odd behavior trying to group rows of a pandas dataframe by ID and then selecting out max/min datetimes (w/ timezones). This is with python 2.7, pandas 0.18.1 and numpy 1.11.1 (I saw in earlier posts a similar problem was apparently fixed w/ pandas 0.15).

Specifically, if I try:
print orders.groupby('OrderID')['start_time'].agg(np.min).iloc[:5]

I get:

OrderID
O161101XVS100000044   2016-11-01 12:03:12.920000-04:00
O161101XVS100000047   2016-11-01 12:03:36.693000-04:00
O161101XVS100000098   2016-11-01 12:09:08.330000-04:00
O161101XVS100000122   2016-11-01 12:09:59.950000-04:00
O161101XVS100000152   2016-11-01 12:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]

Where the raw data had times closer to 8 am (US/Eastern). In other words, it reverted back to UTC times, even though it says it's eastern times, and has UTC-4 offset.

But if I instead try:
print orders.groupby('OrderID')['start_time'].agg(lambda x: np.min(x)).iloc[:5]

I now get:

OrderID
O161101XVS100000044   2016-11-01 08:03:12.920000-04:00
O161101XVS100000047   2016-11-01 08:03:36.693000-04:00
O161101XVS100000098   2016-11-01 08:09:08.330000-04:00
O161101XVS100000122   2016-11-01 08:09:59.950000-04:00
O161101XVS100000152   2016-11-01 08:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]

Which is the behavior I intended. This second method is vastly slower, and I would have assumed the two approaches would yield identical results ...

@TomAugspurger
Copy link
Contributor

Could you try it out on a more recent version of pandas, or add a copy-pastable example so that someone else can check? Might have been fixed already.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

further np.min almost always does the wrong thing, it doesn't respect timezones.

use .min()

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 16, 2017

Actually, here's a repro:

In [63]: ts = pd.Series(pd.date_range('2016', periods=12, freq='H').tz_localize("UTC").tz_convert("US/Eastern"))

In [64]: ts
Out[64]:
0    2015-12-31 19:00:00-05:00
1    2015-12-31 20:00:00-05:00
2    2015-12-31 21:00:00-05:00
3    2015-12-31 22:00:00-05:00
4    2015-12-31 23:00:00-05:00
                ...
7    2016-01-01 02:00:00-05:00
8    2016-01-01 03:00:00-05:00
9    2016-01-01 04:00:00-05:00
10   2016-01-01 05:00:00-05:00
11   2016-01-01 06:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [65]: ts.groupby(level=0).agg(np.min)
Out[65]:
0    2016-01-01 00:00:00-05:00
1    2016-01-01 01:00:00-05:00
2    2016-01-01 02:00:00-05:00
3    2016-01-01 03:00:00-05:00
4    2016-01-01 04:00:00-05:00
                ...
7    2016-01-01 07:00:00-05:00
8    2016-01-01 08:00:00-05:00
9    2016-01-01 09:00:00-05:00
10   2016-01-01 10:00:00-05:00
11   2016-01-01 11:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [66]: ts.groupby(level=0).min()
Out[66]:
0    2016-01-01 00:00:00-05:00
1    2016-01-01 01:00:00-05:00
2    2016-01-01 02:00:00-05:00
3    2016-01-01 03:00:00-05:00
4    2016-01-01 04:00:00-05:00
                ...
7    2016-01-01 07:00:00-05:00
8    2016-01-01 08:00:00-05:00
9    2016-01-01 09:00:00-05:00
10   2016-01-01 10:00:00-05:00
11   2016-01-01 11:00:00-05:00
dtype: datetime64[ns, US/Eastern]

further np.min almost always does the wrong thing, it doesn't respect timezones.

my thought too, but .min also shows the issue.

@TomAugspurger
Copy link
Contributor

I think the expected output there is identical to the input (since the index is already unique).

@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

In [19]: data = """O161101XVS100000044   2016-11-01 12:03:12.920000-04:00
    ...: O161101XVS100000047   2016-11-01 12:03:36.693000-04:00
    ...: O161101XVS100000098   2016-11-01 12:09:08.330000-04:00
    ...: O161101XVS100000122   2016-11-01 12:09:59.950000-04:00
    ...: O161101XVS100000152   2016-11-01 12:11:29.790000-04:00
    ...: """

In [20]: df = pd.read_csv(StringIO(data),header=None,sep='\s+')

In [21]: df.columns=['value','date','time']

In [22]: df['datetime'] = pd.to_datetime(df.date + ' ' + df.time).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

In [23]: df
Out[23]: 
                 value        date                   time                         datetime
0  O161101XVS100000044  2016-11-01  12:03:12.920000-04:00 2016-11-01 12:03:12.920000-04:00
1  O161101XVS100000047  2016-11-01  12:03:36.693000-04:00 2016-11-01 12:03:36.693000-04:00
2  O161101XVS100000098  2016-11-01  12:09:08.330000-04:00 2016-11-01 12:09:08.330000-04:00
3  O161101XVS100000122  2016-11-01  12:09:59.950000-04:00 2016-11-01 12:09:59.950000-04:00
4  O161101XVS100000152  2016-11-01  12:11:29.790000-04:00 2016-11-01 12:11:29.790000-04:00

In [24]: df.dtypes
Out[24]: 
value                           object
date                            object
time                            object
datetime    datetime64[ns, US/Eastern]
dtype: object

In [25]: df.groupby('value').datetime.min()
Out[25]: 
value
O161101XVS100000044   2016-11-01 16:03:12.920000-04:00
O161101XVS100000047   2016-11-01 16:03:36.693000-04:00
O161101XVS100000098   2016-11-01 16:09:08.330000-04:00
O161101XVS100000122   2016-11-01 16:09:59.950000-04:00
O161101XVS100000152   2016-11-01 16:11:29.790000-04:00
Name: datetime, dtype: datetime64[ns, US/Eastern]

@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

dupe of this: #10668

though I like this example.

@jreback jreback closed this as completed Feb 16, 2017
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Feb 16, 2017
@jreback jreback added this to the No action milestone Feb 16, 2017
@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

actually, let's leave this one open instead.

@jreback jreback reopened this Feb 16, 2017
@jreback jreback removed the Duplicate Report Duplicate issue or pull request label Feb 16, 2017
@jreback jreback modified the milestones: 0.20.0, No action Feb 16, 2017
@jreback jreback changed the title Odd timezone behavior using groupby/agg in pandas BUG: timezone lost in groupby-agg with cython functions Feb 16, 2017
@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

@munierSalem if you'd like to debug would be great!

The groupby tz support is a bit buggy. Basically since these are converted to i8 undert the hood to actually do the operations, need to:

  • if a datetime64tz (by-definition this will be a single tz, if its multiple this would be an object column)
    • convert to UTC, keep track of tz
    • convert to i8 (already there)
    • perform operation, comes back as i8
    • localize to UTC, convert to original tz

roughtly here:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L1896

@munierSalem
Copy link
Author

@jreback I can fix in my local repo, but I'll need to wait to do so from home to push back ... working behind a draconian corporate firewall :(

@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

sure np

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#15426

Author: Stephen Rauch <[email protected]>

Closes pandas-dev#15433 from stephenrauch/tz-lost-in-groupby-agg and squashes the following commits:

64a84ca [Stephen Rauch] BUG: GH15426 timezone lost in groupby-agg with cython functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants