Skip to content

AmbiguousTimeError on groupby when including a DST change #14682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
j-santander opened this issue Nov 17, 2016 · 2 comments
Closed

AmbiguousTimeError on groupby when including a DST change #14682

j-santander opened this issue Nov 17, 2016 · 2 comments
Labels
Bug Groupby Timezones Timezone data dtype
Milestone

Comments

@j-santander
Copy link

A small, complete example of the issue

#!/usr/bin/env python
import pandas as pd
df=pd.DataFrame([1477786980,1477790580],columns=['ts'])
df['date']=pd.to_datetime(df.ts, unit='s').dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid')
df.set_index('date', inplace=True)

dfo = df.groupby(pd.TimeGrouper('5min'))

Expected Output

                           ts
date                         
2016-10-30 02:20:00+02:00   1
2016-10-30 02:25:00+02:00   0
2016-10-30 02:30:00+02:00   0
2016-10-30 02:35:00+02:00   0
2016-10-30 02:40:00+02:00   0
2016-10-30 02:45:00+02:00   0
2016-10-30 02:50:00+02:00   0
2016-10-30 02:55:00+02:00   0
2016-10-30 02:00:00+01:00   0
2016-10-30 02:05:00+01:00   0
2016-10-30 02:10:00+01:00   0
2016-10-30 02:15:00+01:00   0
2016-10-30 02:20:00+01:00   1

Output of pd.show_versions()

# Paste the output here pd.show_versions() here >>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-47-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.6.1
Cython: 0.25.1
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: 1.4.8
patsy: None
dateutil: 2.4.2
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: None
openpyxl: 2.2.6
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.999
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The above code raises an AmbiguousTimeError exception, when grouping by a time-date series including a DST change. In the above example the unix timestamps are for the recent DST change in Europe.

The stack trace is:

Traceback (most recent call last):
  File "./t.py", line 7, in <module>
    dfo = df.groupby(pd.TimeGrouper('5min'))
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3984, in groupby
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1501, in groupby
    return klass(obj, by, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 370, in __init__
    mutated=self.mutated)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 2382, in _get_grouper
    binner, grouper, obj = key._get_grouper(obj)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1062, in _get_grouper
    r._set_binner()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 237, in _set_binner
    self.binner, self.grouper = self._get_binner()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 245, in _get_binner
    binner, bins, binlabels = self._get_binner_for_time()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 660, in _get_binner_for_time
    return self.groupby._get_time_bins(self.ax)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1118, in _get_time_bins
    base=self.base)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1262, in _get_range_edges
    closed=closed, base=base)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1326, in _adjust_dates_anchored
    return (Timestamp(fresult).tz_localize(first_tzinfo),
  File "pandas/tslib.pyx", line 621, in pandas.tslib.Timestamp.tz_localize (pandas/tslib.c:13694)
  File "pandas/tslib.pyx", line 4308, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:74816)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from Timestamp('2016-10-30 02:20:00'), try using the 'ambiguous' argument

Code works if the series does not include a DST change (e.g. one day earlier):

#!/usr/bin/env python
import pandas as pd
df=pd.DataFrame([1477700580,1477704180],columns=['ts'])
df['date']=pd.to_datetime(df.ts, unit='s').dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid')
df.set_index('date', inplace=True)

dfo = df.groupby(pd.TimeGrouper('5min'))

print dfo.count()

gets:

                           ts
date                         
2016-10-29 02:20:00+02:00   1
2016-10-29 02:25:00+02:00   0
2016-10-29 02:30:00+02:00   0
2016-10-29 02:35:00+02:00   0
2016-10-29 02:40:00+02:00   0
2016-10-29 02:45:00+02:00   0
2016-10-29 02:50:00+02:00   0
2016-10-29 02:55:00+02:00   0
2016-10-29 03:00:00+02:00   0
2016-10-29 03:05:00+02:00   0
2016-10-29 03:10:00+02:00   0
2016-10-29 03:15:00+02:00   0
2016-10-29 03:20:00+02:00   1
@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

xref #10668 (though this looks separate).

yeah, prob need to specify ambiguous when creating the bins. a pull-request to fix would make the fix happen sooner.

@jreback jreback added this to the Next Major Release milestone Nov 17, 2016
@j-santander
Copy link
Author

I've been trying to debug the above issue.

Tried adding the ambiguous keyword to the constructor of the Timestamps... but I wasn't sure how to set it (as infer) didn't seem to be a valid option.

The code raising the exception seems to have been modified with commit dcc68d7 where the _adjust_dates_anchored() function at pandas.tseries.resample module first drops the tz information at the beginning of the function and then adds it back on the return statement.

I've modified the code to not do that... but then I had to modify an assert at pandas.tseries.index.py that it is checking for equality of time zones... but it turns that Europe/Madrid on DST is considered different from Europe/Madrid not on DST.

I'll try to create a pull request with my changes so that you can comment.

@jreback jreback modified the milestones: 0.19.2, Next Major Release Nov 21, 2016
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Dec 14, 2016
closes pandas-dev#14682

Author: Julian Santander <[email protected]>
Author: Julian Santander <[email protected]>

Closes pandas-dev#14683 from j-santander/master and squashes the following commits:

d90afaf [Julian Santander] Addressing additional code inspection comments
817ed97 [Julian Santander] Addressing code inspections comments
99a5367 [Julian Santander] Fix unittest error and lint warning
940fb22 [Julian Santander] Avoid AmbiguousTimeError on groupby

(cherry picked from commit 9f2e453)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants