Skip to content

Failing statsmodels tests on pandas master vs. 0.12.0 #5312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue Oct 24, 2013 · 18 comments · Fixed by #5327
Closed

Failing statsmodels tests on pandas master vs. 0.12.0 #5312

jseabold opened this issue Oct 24, 2013 · 18 comments · Fixed by #5327
Labels
Testing pandas testing functions or related to the test suite
Milestone

Comments

@jseabold
Copy link
Contributor

https://launchpadlibrarian.net/154849014/buildlog_ubuntu-trusty-i386.statsmodels_0.6.0~ppa18~revno-1486~ubuntu14.04.1_UPLOADING.txt.gz

======================================================================
ERROR: statsmodels.iolib.tests.test_foreign.test_genfromdta_datetime
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/lib/python2.7/dist-packages/numpy/testing/decorators.py", line 146, in skipper_func
    return f(*args, **kwargs)
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/iolib/tests/test_foreign.py", line 139, in test_genfromdta_datetime
    pandas=True)
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/iolib/foreign.py", line 1049, in genfromdta
    args=(fmtlist[i],))
  File "/usr/lib/pymodules/python2.7/pandas/core/series.py", line 1978, in apply
    return self._constructor(mapped, index=self.index).__finalize__(self)
  File "/usr/lib/pymodules/python2.7/pandas/core/series.py", line 217, in __init__
    data = SingleBlockManager(data, index, fastpath=True)
  File "/usr/lib/pymodules/python2.7/pandas/core/internals.py", line 3295, in __init__
    block = make_block(block, axis, axis, ndim=1, fastpath=True)
  File "/usr/lib/pymodules/python2.7/pandas/core/internals.py", line 1806, in make_block
    return klass(values, items, ref_items, ndim=ndim, fastpath=fastpath, placement=placement)
  File "/usr/lib/pymodules/python2.7/pandas/core/internals.py", line 1412, in __init__
    values = tslib.cast_to_nanoseconds(values)
  File "tslib.pyx", line 1453, in pandas.tslib.cast_to_nanoseconds (pandas/tslib.c:22283)
TypeError: Cannot change data-type for object array.

======================================================================
ERROR: statsmodels.tsa.tests.test_arima.test_arma_predict_indices
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/tsa/tests/test_arima.py", line 975, in test_arma_predict_indices
    _check_start(*((model,) + case))
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/tsa/tests/test_arima.py", line 921, in _check_start
    start = model._get_predict_start(given, dynamic)
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/tsa/arima_model.py", line 562, in _get_predict_start
    start = super(ARMA, self)._get_predict_start(start)
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/tsa/base/tsa_model.py", line 130, in _get_predict_start
    self._set_predict_start_date(start)
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/tsa/base/tsa_model.py", line 105, in _set_predict_start_date
    start, self.data.freq)
  File "/build/buildd/statsmodels-0.6.0~ppa18~revno/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/tsa/base/datetools.py", line 78, in _date_from_idx
    return d1 + idx * _freq_to_pandas[freq]
  File "/usr/lib/pymodules/python2.7/pandas/tseries/offsets.py", line 193, in __radd__
    return self.__add__(other)
  File "/usr/lib/pymodules/python2.7/pandas/tseries/offsets.py", line 188, in __add__
    return self.apply(other)
  File "/usr/lib/pymodules/python2.7/pandas/tseries/offsets.py", line 1537, in apply
    return Timestamp(result)
  File "tslib.pyx", line 153, in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:5375)
  File "tslib.pyx", line 773, in pandas.tslib.convert_to_tsobject (pandas/tslib.c:13113)
  File "tslib.pyx", line 858, in pandas.tslib._check_dts_bounds (pandas/tslib.c:14219)
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2317-12-31 00:00:00

Did something change?

@jreback
Copy link
Contributor

jreback commented Oct 24, 2013

when did this start failing?

@jseabold
Copy link
Contributor Author

Somewhere between master and 0.12.0.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2013

what I mean is I tests statsmodels 5.0 with master maybe 1 month ago and these were fine.

@jseabold
Copy link
Contributor Author

None of this code has changed in statsmodels. I see these failure with 0.5.0 against pandas master.

@jseabold
Copy link
Contributor Author

Something in numpy?

numpy: 1.9.0.dev-b5dab6d (/usr/local/lib/python2.7/dist-packages/numpy)

@jseabold
Copy link
Contributor Author

Sorry, I've been bug hunting all week and don't have time to track this one down at the moment.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2013

1st issue:

I think your _stata_elapsed_date_to_datetime with a format of %ty converts a number
like 2 to datetime(2,1,1), which is invalid. not sure what it is supposed to be, 2000?
either need some validation there or you can catch the exception.

the pandas converters are a little more strict on how they convert, that's why I think this is raising.

in statsmodels/iolib/foreign.py:genfromdta

1047                    col = data.columns[col]
1048                    data[col] = data[col].apply(_stata_elapsed_date_to_datetime,
1049 ->                         args=(fmtlist[i],))
1050        elif convert_dates:
1051            #date_cols = np.where(map(lambda x : x in _date_formats,
1052            #                                                    fmtlist))[0]
1053            # make the dtype for the datetime types
1054            cols = np.where(map(lambda x : x in _date_formats, fmtlist))[0]
(Pdb) p fmtlist[i]
'%ty'
(Pdb) p data[col]
0    2010
1       2
Name: yearly_date, dtype: int32

@jseabold
Copy link
Contributor Author

No, this was deliberate in the test suite since this is a valid date in the stata epoch time. It used to round trip fine from this

from datetime import datetime
datetime(2, 1,1)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2013

2nd is an out of bounds on the dates (which again is not actually checked, in 0.12 it was possible to have an out-of-bounds date slip thru). You can catch this error.

> /home/vagrant/statsmodels/statsmodels/tsa/base/tsa_model.py(105)_set_predict_start_date()
-> start, self.data.freq)
(Pdb) l
100                 return
101             if start > len(dates):
102                 raise ValueError("Start must be <= len(endog)")
103             if start == len(dates):
104                 self.data.predict_start = datetools._date_from_idx(dates[-1],
105  ->                                                     start, self.data.freq)
106             elif start < len(dates):
107                 self.data.predict_start = dates[start]
108             else:
109                 raise ValueError("Start must be <= len(dates)")
110  
(Pdb) p dates
<class 'pandas.tseries.index.DatetimeIndex'>
[1700-12-31 00:00:00, ..., 2008-12-31 00:00:00]
Length: 309, Freq: None, Timezone: None
(Pdb) d
> /home/vagrant/statsmodels/statsmodels/tsa/base/datetools.py(78)_date_from_idx()
-> return d1 + idx * _freq_to_pandas[freq]
(Pdb) d
> /usr/local/lib/python2.7/dist-packages/pandas-0.12.0_957_g8941429-py2.7-linux-i686.egg/pandas/tseries/offsets.py(193)__radd__()
-> return self.__add__(other)
(Pdb) d
> /usr/local/lib/python2.7/dist-packages/pandas-0.12.0_957_g8941429-py2.7-linux-i686.egg/pandas/tseries/offsets.py(188)__add__()
-> return self.apply(other)
(Pdb) d
> /usr/local/lib/python2.7/dist-packages/pandas-0.12.0_957_g8941429-py2.7-linux-i686.egg/pandas/tseries/offsets.py(1537)apply()
-> return Timestamp(result)
(Pdb) p result
datetime.datetime(2317, 12, 31, 0, 0)
(Pdb) Timestamp(result)
*** OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2317-12-31 00:00:00

@jreback
Copy link
Contributor

jreback commented Oct 25, 2013

ok....1st is a bug....have a PR to fix, easy

2nd is more troublesome, you are adding a Timestamp with an offset then yields an out-of-bounds Timestamp. We normally raise on this. You could catch it and just use it as a datetime if you want.

(Pdb) Timestamp(result)
*** OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2317-12-31 00:00:00
(Pdb) p result
datetime.datetime(2317, 12, 31, 0, 0)

@jseabold
Copy link
Contributor Author

I'm trying to figure out if this second test was always broken. AFAIK, this is not the expected result and what it used to do, but I'm not certain yet. I need to build everything in a virtualenv to test.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2013

ok....going to merge the first fix, but leave this issue open...lmk

@jseabold
Copy link
Contributor Author

Yeah, this one used to work too.

>>> import pandas as pd
>>> from datetime import datetime
>>> from statsmodels.tsa.base.datetools import _freq_to_pandas
>>> pd.Timestamp(datetime(2008, 12, 31)) + 309*_freq_to_pandas['A']
datetime.datetime(2317, 12, 31, 0, 0)

@jseabold
Copy link
Contributor Author

>>> pd.Timestamp(datetime(2008, 12, 31)) + 309*pd.offsets.YearEnd()
datetime.datetime(2317, 12, 31, 0, 0)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2013

@jtratner, cc @Cancan01

pls take a look here...the reason this breaks is I put in a change to wrap the returns the applying an offset in a Timestamp, I think @Cancan01 was a test breaking because it was assumed it was a timestamp.

easy to fix this to have it return a timestamp and if its out-of-bounds the datetime.

any problems with that?

@jtratner
Copy link
Contributor

Timestamp is a datetime but with additional methods, right? So, if you assume you'll get a datetime, everything you could do with that datetime you can do with the timestamp?

@jreback
Copy link
Contributor

jreback commented Oct 25, 2013

@jtratner that is true, I am not sure of the guarantees before, but @Cancan01 PR had a failing test that assumed it was getting a Timestamp, I think because of a repeated offset apply, e.g. something like

Timestamp + offset + other_offset, so you need for them to really return Timestamps to make it consistent. That said a Timestamp IS a datetime, except it CANNOT hold out-of-bounds data.

@jtratner
Copy link
Contributor

EAFP - try to be nice and keep it within range and if not return a datetime and let the failure happen later - feels the same as automatically converting an integer column to float if you add nan or add 0.1 to it, or converting indexes on slicing, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
3 participants