Skip to content

Resampling a Series with a timezone using kind='period' Crashes with ~6000 Values #5430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinastone opened this issue Nov 4, 2013 · 23 comments · Fixed by #5432
Closed
Labels
Bug Groupby Period Period data type Timezones Timezone data dtype
Milestone

Comments

@kevinastone
Copy link
Contributor

I wrote a test case that consistently crashes the entire process. It looks like it requires a Series with data localized to a timezone that has a DST and the data crosses the DST boundary. Finally, you have to use kind='period'for the resample() operation. Oddly, it's not just the actual boundary, because I can create a smaller dataset, and it resamples fine (included in the test case with the _works suffix.

With that combination, the code crashes the entire process with a glibc error.

Crashing Test Case

https://gist.github.com/kevinastone/7297033

>>> series.resample('D', how='sum', kind='period')
*** glibc detected *** ${VENV_DIR}/bin/python: double free or corruption (!prev): 0x00007fd6845cbb60 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7fd693178b96]
${VENV_DIR}/local/lib/python2.7/site-packages/numpy/core/multiarray.so(+0x5f7a6)[0x7fd68f0f67a6]
${VENV_DIR}/local/lib/python2.7/site-packages/numpy/core/multiarray.so(+0xc4026)[0x7fd68f15b026]
${VENV_DIR}/local/lib/python2.7/site-packages/pandas/algos.so(+0x1184b2)[0x7fd689b004b2]
${VENV_DIR}/local/lib/python2.7/site-packages/pandas/algos.so(+0x11933c)[0x7fd689b0133c]
...
@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

what version of pandas?

@kevinastone
Copy link
Contributor Author

Sorry, 0.12.0

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

can u try in master?

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

Also, is 6K the minimum that causes this to occur?

@kevinastone
Copy link
Contributor Author

Yeah, error still persists.

I created a new virtualenv, cloned master at c435e72. After installation of pandas from source, my pip freeze looks like the following:

Cython==0.19.2
argparse==1.2.1
distribute==0.6.24
numpy==1.8.0
pandas==0.12.0-1040-gc435e72
python-dateutil==2.2
pytz==2013.7
six==1.4.1
wsgiref==0.1.2

@kevinastone
Copy link
Contributor Author

Well, the error happens with 5998 entries, so it's not a 6k boundary, but didn't seem to happen with ~5400 entries. I just kept moving the date and re-ran the test.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

ok thanks for the report; marking as a bug

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

Okay, narrows it down.

@kevinastone
Copy link
Contributor Author

I'm still tracing the execution, but I wouldn't be surprised if this failure to correct a non-monotonic index is leading the crash.

https://github.com/pydata/pandas/blob/master/pandas/tseries/resample.py#L80

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

So if you put a sort_index() call there it works?

@kevinastone
Copy link
Contributor Author

The exception path seems to call sort_index() without arguments. I looked at the index and it seems to be sorted properly.

if I change it from a built-in function (like sum) to a custom one, it works without crashing. So, something in the python optimized path is unhappy.

# Works fine
In [30]: series.resample('D', how=lambda x: len(x), kind='period')
Out[30]: 
2013-10-04     218
2013-10-05     182
2013-10-06     108
2013-10-07     112
2013-10-08     262
2013-10-09     175
2013-10-10     185
2013-10-11      76
2013-10-12      36
2013-10-13     264
2013-10-14     200
2013-10-15     228
2013-10-16     208
2013-10-17     172
2013-10-18     114
2013-10-19     130
2013-10-20     249
2013-10-21     341
2013-10-22    1210
2013-10-23     505
2013-10-24     202
2013-10-25     223
2013-10-26      83
2013-10-27      81
2013-10-28     153
2013-10-29      60
2013-10-30      45
2013-10-31      50
2013-11-01      31
2013-11-02      62
2013-11-03      16
Freq: D, dtype: int64

@kevinastone
Copy link
Contributor Author

Another data point, switching from sum to mean gives an error about mis-matching index size (index size was 31, output was 32):

pandas_bug.py in crash()
     47         # series.resample('D', how='sum', kind='period')
---> 48         series.resample('D', how='mean', kind='period')
     49         # series.resample('D', how=lambda x: len(x), kind='period')
     50 

pandas/core/generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
   2413                               fill_method=fill_method, convention=convention,
   2414                               limit=limit, base=base)
-> 2415         return sampler.resample(self)
   2416 
   2417     def first(self, offset):

pandas/tseries/resample.pyc in resample(self, obj)
     82 
     83         if isinstance(axis, DatetimeIndex):
---> 84             rs = self._resample_timestamps(obj)
     85         elif isinstance(axis, PeriodIndex):
     86             offset = to_offset(self.freq)

pandas/tseries/resample.pyc in _resample_timestamps(self, obj)
    227             # Irregular data, have to use groupby
    228             grouped = obj.groupby(grouper, axis=self.axis)
--> 229             result = grouped.aggregate(self._agg_method)
    230 
    231             if self.fill_method is not None:

pandas/core/groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
   1465         """
   1466         if isinstance(func_or_funcs, compat.string_types):
-> 1467             return getattr(self, func_or_funcs)(*args, **kwargs)
   1468 
   1469         if hasattr(func_or_funcs, '__iter__'):

pandas/core/groupby.pyc in mean(self)
    404         except Exception:  # pragma: no cover
    405             f = lambda x: x.mean(axis=self.axis)
--> 406             return self._python_agg_general(f)
    407 
    408     def median(self):

pandas/core/groupby.pyc in _python_agg_general(self, func, *args, **kwargs)
    540                 output[name] = self._try_cast(values[mask],result)
    541 
--> 542         return self._wrap_aggregated_output(output)
    543 
    544     def _wrap_applied_output(self, *args, **kwargs):

pandas/core/groupby.pyc in _wrap_aggregated_output(self, output, names)
   1529             return DataFrame(output, index=index, columns=names)
   1530         else:
-> 1531             return Series(output, index=index, name=self.name)
   1532 
   1533     def _wrap_applied_output(self, keys, values, not_indexed_same=False):

pandas/core/series.pyc in __init__(self, data, index, dtype, name, copy, fastpath)
    215                                        raise_cast_failure=True)
    216 
--> 217                 data = SingleBlockManager(data, index, fastpath=True)
    218 
    219         generic.NDFrame.__init__(self, data, fastpath=True)

pandas/core/internals.pyc in __init__(self, block, axis, do_integrity_check, fastpath)
   3297                 block = block[0]
   3298             if not isinstance(block, Block):
-> 3299                 block = make_block(block, axis, axis, ndim=1, fastpath=True)
   3300 
   3301         else:

pandas/core/internals.pyc in make_block(values, items, ref_items, klass, ndim, dtype, fastpath, placement)
   1805                 klass = ObjectBlock
   1806 
-> 1807     return klass(values, items, ref_items, ndim=ndim, fastpath=fastpath, placement=placement)
   1808 
   1809 # TODO: flexible with index=None and/or items=None

pandas/core/internals.pyc in __init__(self, values, items, ref_items, ndim, fastpath, placement)
     60         if len(items) != len(values):
     61             raise ValueError('Wrong number of items passed %d, indices imply %d'
---> 62                              % (len(items), len(values)))
     63 
     64         self.set_ref_locs(placement)

ValueError: Wrong number of items passed 31, indices imply 32
> pandas/core/internals.py(62)__init__()
     61             raise ValueError('Wrong number of items passed %d, indices imply %d'
---> 62                              % (len(items), len(values)))
     63 

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

What happens if you call sort_index() on it? Does it crash?

@kevinastone
Copy link
Contributor Author

Yeah, it still crashes. I think I was a bit misguided on the sort_index(), that seems to be the correct way to set it up even though catching a TypeError is an odd way to test the input.

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

What? I'm confused now.

@kevinastone
Copy link
Contributor Author

It doesn't seem to be due to the sorting of the index. It looks like it's because the grouper is creating an extra group for the day during daylight savings transition.

Even on small sets that cross a day boundary, like this case with a range from 11/1 to 11/2 fails, so its not just daylight savings related.

import pytz
import pandas as pd
import datetime

local_timezone = pytz.timezone('America/Los_Angeles')

start = datetime.datetime(year=2013, month=11, day=1, hour=0, minute=0, tzinfo=pytz.utc)
# 1 day later
end = datetime.datetime(year=2013, month=11, day=2, hour=0, minute=0, tzinfo=pytz.utc)

index = pd.date_range(start, end, freq='H')

series = pd.Series(index=index)
series= series.tz_convert(local_timezone)
series.resample('D', kind='period')

@kevinastone
Copy link
Contributor Author

Here's the problem:

https://github.com/pydata/pandas/blob/master/pandas/tseries/resample.py#L195

It's converting the periods back into timestamps, but it lost the timezone in the process. So, it's incorrectly partitioning.

Also, it's hard coded to a Day (D) frequency, when it really should use the input frequency (self.freq).

        end_stamps = (labels + 1).asfreq(self.freq, 's').to_timestamp()
        if axis.tzinfo:
            end_stamps = end_stamps.tz_localize(axis.tzinfo)
        bins = axis.searchsorted(end_stamps, side='left')

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

Nice detective work there! Do you want to submit a pull request as well?

@cpcloud
Copy link
Member

cpcloud commented Nov 4, 2013

I think this might fix #4076 and #3609 too.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

@kevinastone can you try tests cases for #4076 and #3609 to see if your fix help their too? (put in separate commits), if they don't work, then can easily discard (or could do a separate PR)

@kevinastone
Copy link
Contributor Author

Negative on #4076, still adding an extra period:

In [8]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:    last_4_weeks_range = pandas.date_range(                                
:            start=datetime.datetime(2001, 5, 4), periods=28)               
:    last_4_weeks = pandas.DataFrame(                                       
:        [{'REST_KEY': 1, 'DLY_TRN_QT': 80, 'DLY_SLS_AMT': 90,              
:            'COOP_DLY_TRN_QT': 30, 'COOP_DLY_SLS_AMT': 20}] * 28 +         
:        [{'REST_KEY': 2, 'DLY_TRN_QT': 70, 'DLY_SLS_AMT': 10,              
:            'COOP_DLY_TRN_QT': 50, 'COOP_DLY_SLS_AMT': 20}] * 28,          
:        index=last_4_weeks_range.append(last_4_weeks_range))               
:    last_4_weeks.sort(inplace=True)
:
:--

In [9]: last_4_weeks.resample('7D', how='sum')
Out[9]: 
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  \
2001-05-04               280              560          700        1050   
2001-05-11               280              560          700        1050   
2001-05-18               280              560          700        1050   
2001-05-25               280              560          700        1050   
2001-06-01                 0                0            0           0   

            REST_KEY  
2001-05-04        21  
2001-05-11        21  
2001-05-18        21  
2001-05-25        21  
2001-06-01         0  

Affirmative on #3609:

In [28]: s = Series(range(100),index=date_range('20130101',freq='s',periods=100),dtype='float')

In [29]: s[10:30] = np.nan

In [30]: s.to_period().resample('T',kind='period')
Out[30]: 
2013-01-01 00:00    34.5
2013-01-01 00:01    79.5
Freq: T, dtype: float64

In [31]: s.resample('T',kind='period')
Out[31]: 
2013-01-01 00:00    34.5
2013-01-01 00:01    79.5
Freq: T, dtype: float64

@jreback
Copy link
Contributor

jreback commented Nov 4, 2013

gr8 so add that on add as test to the PR (3609)

@kevinastone
Copy link
Contributor Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Period Period data type Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants