Resampling a Series with a timezone using kind='period' Crashes with ~6000 Values #5430

kevinastone · 2013-11-04T02:07:13Z

I wrote a test case that consistently crashes the entire process. It looks like it requires a Series with data localized to a timezone that has a DST and the data crosses the DST boundary. Finally, you have to use kind='period'for the resample() operation. Oddly, it's not just the actual boundary, because I can create a smaller dataset, and it resamples fine (included in the test case with the _works suffix.

With that combination, the code crashes the entire process with a glibc error.

Crashing Test Case

https://gist.github.com/kevinastone/7297033

>>> series.resample('D', how='sum', kind='period')
*** glibc detected *** ${VENV_DIR}/bin/python: double free or corruption (!prev): 0x00007fd6845cbb60 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7fd693178b96]
${VENV_DIR}/local/lib/python2.7/site-packages/numpy/core/multiarray.so(+0x5f7a6)[0x7fd68f0f67a6]
${VENV_DIR}/local/lib/python2.7/site-packages/numpy/core/multiarray.so(+0xc4026)[0x7fd68f15b026]
${VENV_DIR}/local/lib/python2.7/site-packages/pandas/algos.so(+0x1184b2)[0x7fd689b004b2]
${VENV_DIR}/local/lib/python2.7/site-packages/pandas/algos.so(+0x11933c)[0x7fd689b0133c]
...

The text was updated successfully, but these errors were encountered:

jreback · 2013-11-04T02:29:52Z

what version of pandas?

kevinastone · 2013-11-04T02:39:11Z

Sorry, 0.12.0

jreback · 2013-11-04T02:40:38Z

can u try in master?

jtratner · 2013-11-04T02:49:06Z

Also, is 6K the minimum that causes this to occur?

kevinastone · 2013-11-04T02:51:13Z

Yeah, error still persists.

I created a new virtualenv, cloned master at c435e72. After installation of pandas from source, my pip freeze looks like the following:

Cython==0.19.2
argparse==1.2.1
distribute==0.6.24
numpy==1.8.0
pandas==0.12.0-1040-gc435e72
python-dateutil==2.2
pytz==2013.7
six==1.4.1
wsgiref==0.1.2

kevinastone · 2013-11-04T02:52:18Z

Well, the error happens with 5998 entries, so it's not a 6k boundary, but didn't seem to happen with ~5400 entries. I just kept moving the date and re-ran the test.

jreback · 2013-11-04T02:54:14Z

ok thanks for the report; marking as a bug

jtratner · 2013-11-04T02:55:05Z

Okay, narrows it down.

kevinastone · 2013-11-04T03:01:33Z

I'm still tracing the execution, but I wouldn't be surprised if this failure to correct a non-monotonic index is leading the crash.

https://github.com/pydata/pandas/blob/master/pandas/tseries/resample.py#L80

jtratner · 2013-11-04T03:05:46Z

So if you put a sort_index() call there it works?

kevinastone · 2013-11-04T03:18:05Z

The exception path seems to call sort_index() without arguments. I looked at the index and it seems to be sorted properly.

if I change it from a built-in function (like sum) to a custom one, it works without crashing. So, something in the python optimized path is unhappy.

# Works fine
In [30]: series.resample('D', how=lambda x: len(x), kind='period')
Out[30]: 
2013-10-04     218
2013-10-05     182
2013-10-06     108
2013-10-07     112
2013-10-08     262
2013-10-09     175
2013-10-10     185
2013-10-11      76
2013-10-12      36
2013-10-13     264
2013-10-14     200
2013-10-15     228
2013-10-16     208
2013-10-17     172
2013-10-18     114
2013-10-19     130
2013-10-20     249
2013-10-21     341
2013-10-22    1210
2013-10-23     505
2013-10-24     202
2013-10-25     223
2013-10-26      83
2013-10-27      81
2013-10-28     153
2013-10-29      60
2013-10-30      45
2013-10-31      50
2013-11-01      31
2013-11-02      62
2013-11-03      16
Freq: D, dtype: int64

kevinastone · 2013-11-04T03:28:32Z

Another data point, switching from sum to mean gives an error about mis-matching index size (index size was 31, output was 32):

pandas_bug.py in crash()
     47         # series.resample('D', how='sum', kind='period')
---> 48         series.resample('D', how='mean', kind='period')
     49         # series.resample('D', how=lambda x: len(x), kind='period')
     50 

pandas/core/generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
   2413                               fill_method=fill_method, convention=convention,
   2414                               limit=limit, base=base)
-> 2415         return sampler.resample(self)
   2416 
   2417     def first(self, offset):

pandas/tseries/resample.pyc in resample(self, obj)
     82 
     83         if isinstance(axis, DatetimeIndex):
---> 84             rs = self._resample_timestamps(obj)
     85         elif isinstance(axis, PeriodIndex):
     86             offset = to_offset(self.freq)

pandas/tseries/resample.pyc in _resample_timestamps(self, obj)
    227             # Irregular data, have to use groupby
    228             grouped = obj.groupby(grouper, axis=self.axis)
--> 229             result = grouped.aggregate(self._agg_method)
    230 
    231             if self.fill_method is not None:

pandas/core/groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
   1465         """
   1466         if isinstance(func_or_funcs, compat.string_types):
-> 1467             return getattr(self, func_or_funcs)(*args, **kwargs)
   1468 
   1469         if hasattr(func_or_funcs, '__iter__'):

pandas/core/groupby.pyc in mean(self)
    404         except Exception:  # pragma: no cover
    405             f = lambda x: x.mean(axis=self.axis)
--> 406             return self._python_agg_general(f)
    407 
    408     def median(self):

pandas/core/groupby.pyc in _python_agg_general(self, func, *args, **kwargs)
    540                 output[name] = self._try_cast(values[mask],result)
    541 
--> 542         return self._wrap_aggregated_output(output)
    543 
    544     def _wrap_applied_output(self, *args, **kwargs):

pandas/core/groupby.pyc in _wrap_aggregated_output(self, output, names)
   1529             return DataFrame(output, index=index, columns=names)
   1530         else:
-> 1531             return Series(output, index=index, name=self.name)
   1532 
   1533     def _wrap_applied_output(self, keys, values, not_indexed_same=False):

pandas/core/series.pyc in __init__(self, data, index, dtype, name, copy, fastpath)
    215                                        raise_cast_failure=True)
    216 
--> 217                 data = SingleBlockManager(data, index, fastpath=True)
    218 
    219         generic.NDFrame.__init__(self, data, fastpath=True)

pandas/core/internals.pyc in __init__(self, block, axis, do_integrity_check, fastpath)
   3297                 block = block[0]
   3298             if not isinstance(block, Block):
-> 3299                 block = make_block(block, axis, axis, ndim=1, fastpath=True)
   3300 
   3301         else:

pandas/core/internals.pyc in make_block(values, items, ref_items, klass, ndim, dtype, fastpath, placement)
   1805                 klass = ObjectBlock
   1806 
-> 1807     return klass(values, items, ref_items, ndim=ndim, fastpath=fastpath, placement=placement)
   1808 
   1809 # TODO: flexible with index=None and/or items=None

pandas/core/internals.pyc in __init__(self, values, items, ref_items, ndim, fastpath, placement)
     60         if len(items) != len(values):
     61             raise ValueError('Wrong number of items passed %d, indices imply %d'
---> 62                              % (len(items), len(values)))
     63 
     64         self.set_ref_locs(placement)

ValueError: Wrong number of items passed 31, indices imply 32
> pandas/core/internals.py(62)__init__()
     61             raise ValueError('Wrong number of items passed %d, indices imply %d'
---> 62                              % (len(items), len(values)))
     63

jtratner · 2013-11-04T03:56:11Z

What happens if you call sort_index() on it? Does it crash?

kevinastone · 2013-11-04T04:15:42Z

Yeah, it still crashes. I think I was a bit misguided on the sort_index(), that seems to be the correct way to set it up even though catching a TypeError is an odd way to test the input.

jtratner · 2013-11-04T04:47:40Z

What? I'm confused now.

kevinastone · 2013-11-04T05:44:57Z

It doesn't seem to be due to the sorting of the index. It looks like it's because the grouper is creating an extra group for the day during daylight savings transition.

Even on small sets that cross a day boundary, like this case with a range from 11/1 to 11/2 fails, so its not just daylight savings related.

import pytz
import pandas as pd
import datetime

local_timezone = pytz.timezone('America/Los_Angeles')

start = datetime.datetime(year=2013, month=11, day=1, hour=0, minute=0, tzinfo=pytz.utc)
# 1 day later
end = datetime.datetime(year=2013, month=11, day=2, hour=0, minute=0, tzinfo=pytz.utc)

index = pd.date_range(start, end, freq='H')

series = pd.Series(index=index)
series= series.tz_convert(local_timezone)
series.resample('D', kind='period')

kevinastone · 2013-11-04T06:26:45Z

Here's the problem:

https://github.com/pydata/pandas/blob/master/pandas/tseries/resample.py#L195

It's converting the periods back into timestamps, but it lost the timezone in the process. So, it's incorrectly partitioning.

Also, it's hard coded to a Day (D) frequency, when it really should use the input frequency (self.freq).

        end_stamps = (labels + 1).asfreq(self.freq, 's').to_timestamp()
        if axis.tzinfo:
            end_stamps = end_stamps.tz_localize(axis.tzinfo)
        bins = axis.searchsorted(end_stamps, side='left')

jtratner · 2013-11-04T12:09:10Z

Nice detective work there! Do you want to submit a pull request as well?

cpcloud · 2013-11-04T13:26:05Z

I think this might fix #4076 and #3609 too.

jreback · 2013-11-04T20:14:47Z

@kevinastone can you try tests cases for #4076 and #3609 to see if your fix help their too? (put in separate commits), if they don't work, then can easily discard (or could do a separate PR)

kevinastone · 2013-11-04T20:24:34Z

Negative on #4076, still adding an extra period:

In [8]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:    last_4_weeks_range = pandas.date_range(                                
:            start=datetime.datetime(2001, 5, 4), periods=28)               
:    last_4_weeks = pandas.DataFrame(                                       
:        [{'REST_KEY': 1, 'DLY_TRN_QT': 80, 'DLY_SLS_AMT': 90,              
:            'COOP_DLY_TRN_QT': 30, 'COOP_DLY_SLS_AMT': 20}] * 28 +         
:        [{'REST_KEY': 2, 'DLY_TRN_QT': 70, 'DLY_SLS_AMT': 10,              
:            'COOP_DLY_TRN_QT': 50, 'COOP_DLY_SLS_AMT': 20}] * 28,          
:        index=last_4_weeks_range.append(last_4_weeks_range))               
:    last_4_weeks.sort(inplace=True)
:
:--

In [9]: last_4_weeks.resample('7D', how='sum')
Out[9]: 
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  \
2001-05-04               280              560          700        1050   
2001-05-11               280              560          700        1050   
2001-05-18               280              560          700        1050   
2001-05-25               280              560          700        1050   
2001-06-01                 0                0            0           0   

            REST_KEY  
2001-05-04        21  
2001-05-11        21  
2001-05-18        21  
2001-05-25        21  
2001-06-01         0

Affirmative on #3609:

In [28]: s = Series(range(100),index=date_range('20130101',freq='s',periods=100),dtype='float')

In [29]: s[10:30] = np.nan

In [30]: s.to_period().resample('T',kind='period')
Out[30]: 
2013-01-01 00:00    34.5
2013-01-01 00:01    79.5
Freq: T, dtype: float64

In [31]: s.resample('T',kind='period')
Out[31]: 
2013-01-01 00:00    34.5
2013-01-01 00:01    79.5
Freq: T, dtype: float64

jreback · 2013-11-04T20:26:13Z

gr8 so add that on add as test to the PR (3609)

kevinastone · 2013-11-04T21:03:12Z

Done

kevinastone mentioned this issue Nov 4, 2013

BUG: fix Resampling a Series with a timezone using kind='period' (GH5430) #5432

Merged

jtratner closed this as completed in #5432 Nov 6, 2013

kevinastone mentioned this issue Nov 6, 2013

More informative exception when trying to use MS as period frequency #5340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resampling a Series with a timezone using kind='period' Crashes with ~6000 Values #5430

Resampling a Series with a timezone using kind='period' Crashes with ~6000 Values #5430

kevinastone commented Nov 4, 2013

jreback commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jreback commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jreback commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

cpcloud commented Nov 4, 2013

jreback commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jreback commented Nov 4, 2013

kevinastone commented Nov 4, 2013

Resampling a Series with a timezone using kind='period' Crashes with ~6000 Values #5430

Resampling a Series with a timezone using kind='period' Crashes with ~6000 Values #5430

Comments

kevinastone commented Nov 4, 2013

Crashing Test Case

jreback commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jreback commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jreback commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

kevinastone commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jtratner commented Nov 4, 2013

cpcloud commented Nov 4, 2013

jreback commented Nov 4, 2013

kevinastone commented Nov 4, 2013

jreback commented Nov 4, 2013

kevinastone commented Nov 4, 2013