pd.stats.api.ols inconsistent estimates #6874

edwinhu · 2014-04-12T00:20:27Z

I am running into an issue trying to run OLS using pandas 0.13.1.

Here is a simple example: I want to regress a variable on itself, in this case excess returns. The intercept should be 0, and the coefficient should be 1. pandas provides the wrong estimates, while statsmodels gives the correct estimates.

This is not due to the silly regression specification, as I have noticed the pandas.ols estimates are inconsistent for other specifications as well.

Has anyone else encountered this problem?

import pandas as pd
import statsmodels.formula.api

In [1]: pd.ols(y=test.exret,x=test.exret).beta
Out[1]: 
x            0.003107
intercept    0.006438
dtype: float64

In [2]: sm.ols(formula="exret ~ exret", data=test).fit().params
Out[2]: 
Intercept   -3.469447e-18
exret        1.000000e+00
dtype: float64

In [3]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.11.0-19-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
sqlalchemy: 0.9.2
lxml: 3.3.1
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None

The text was updated successfully, but these errors were encountered:

jreback · 2014-04-12T00:26:35Z

seems ok to me

In [7]: x = Series(np.random.randn(100))

In [8]: pd.ols(y=x,x=x)
Out[8]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         100
Number of Degrees of Freedom:   2

R-squared:         1.0000
Adj R-squared:     1.0000

Rmse:              0.0000

F-stat (1, 98):        inf, p-value:     0.0000

Degrees of Freedom: model 1, resid 98

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     1.0000     0.0000 45335499035463352.00     0.0000     1.0000     1.0000
     intercept     0.0000     0.0000       0.65     0.5184    -0.0000     0.0000
---------------------------------End of Summary---------------------------------

In [9]: pd.ols(y=x,x=x).beta
Out[9]: 
x            1.000000e+00
intercept    1.277919e-17
dtype: float64

In [12]: sm.ols(formula="x ~ x", data=x).fit().params
Out[12]: 
Intercept    3.237783e-17
x            1.000000e+00
dtype: float64

edwinhu · 2014-04-12T01:03:31Z

I see the same thing using your example. However it seems the issue occurs when there are row labels.

Is this the expected behavior?

In [1]: a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
Out [1]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         25
Number of Degrees of Freedom:   2

R-squared:        -0.0000
Adj R-squared:    -0.0435

Rmse:              0.5372

F-stat (1, 23):    -0.0000, p-value:     1.0000

Degrees of Freedom: model 1, resid 23

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.0000     0.2085       0.00     1.0000    -0.4087     0.4087
     intercept     0.0832     0.1088       0.76     0.4523    -0.1301     0.2965
---------------------------------End of Summary---------------------------------

In [2]: b = a.reset_index()

In [3]: pd.ols(y=b[0],x=b[0])
Out [3]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   2

R-squared:         1.0000
Adj R-squared:     1.0000

Rmse:              0.0000

F-stat (1, 3):        inf, p-value:     0.0000

Degrees of Freedom: model 1, resid 3

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     1.0000     0.0000 11982830741228190.00     0.0000     1.0000     1.0000
     intercept     0.0000     0.0000       0.64     0.5693    -0.0000     0.0000
---------------------------------End of Summary---------------------------------

jreback · 2014-04-12T01:11:55Z

have duplicate labels rarely makes sense
how would expect it to align the data?

should prob raise an error with a duplicate index

I don't know what statsmodels does in this case

jreback · 2014-04-12T01:12:43Z

@jseabold does patsy/sm align on the index?

edwinhu · 2014-04-12T01:19:51Z

I noticed this issue when using groupby and ols with an indexed DataFrame.

GroupBy splits have "duplicate" row labels. I noticed this issue when
applying pd.ols to a GroupBy object.

It seems that sm correctly ignores the duplicate row labels.

jreback · 2014-04-12T01:29:44Z

would be helpful to show some code
groupby in general won't produce a duplicate indexed frame

jseabold · 2014-04-12T01:47:24Z

We check for alignment.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/data.py#L308

edwinhu · 2014-04-12T01:51:07Z

Sure. My data is organized by id and date. I have the dataframe indexed by id. It looks something like this (without the date column):

a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
b = pd.Series(np.random.randn(5),index=['b','b','b','b','b'])

x = a.append(b)

grp = x.groupby(level=0)

grp.apply(lambda x: pd.ols(y=x,x=x).beta)

a  x            0.000000e+00
   intercept    9.327435e-02
b  x           -8.673617e-17
   intercept    3.037757e-01
dtype: float64

sm.ols(formula="x ~ x",data=x).fit().params

Intercept    6.245005e-17
x            1.000000e+00
dtype: float64

jreback · 2015-03-08T16:47:48Z

refering to statsmodels as this functionaility is not supported (not deprecated either as of yet).

rsdenijs · 2016-02-12T10:02:17Z

What is the status of this issue?
Duplicate indices silently mess up the number of observations resulting model. I guess a check for df.index.is_unique would solve this?

jorisvandenbossche · 2016-02-12T11:28:44Z

@rsdenijs As @jreback pointed out in his last comment, this is not supported anymore in pandas (they will also be effectively deprecated in the coming release, see #11898). So the status of this issue is that we do not plan to take any action on this.

Can you use statsmodels for your use case? (for OLS everything should be in statsmodels, for the other functions in pandas there are still some things missing in statsmodels: statsmodels/statsmodels#2745)

jreback closed this as completed Mar 8, 2015

jreback added the Stats label Mar 8, 2015

jorisvandenbossche added this to the No action milestone Mar 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.stats.api.ols inconsistent estimates #6874

pd.stats.api.ols inconsistent estimates #6874

edwinhu commented Apr 12, 2014

jreback commented Apr 12, 2014

edwinhu commented Apr 12, 2014

jreback commented Apr 12, 2014

jreback commented Apr 12, 2014

edwinhu commented Apr 12, 2014

jreback commented Apr 12, 2014

jseabold commented Apr 12, 2014

edwinhu commented Apr 12, 2014

jreback commented Mar 8, 2015

rsdenijs commented Feb 12, 2016

jorisvandenbossche commented Feb 12, 2016

pd.stats.api.ols inconsistent estimates #6874

pd.stats.api.ols inconsistent estimates #6874

Comments

edwinhu commented Apr 12, 2014

jreback commented Apr 12, 2014

edwinhu commented Apr 12, 2014

jreback commented Apr 12, 2014

jreback commented Apr 12, 2014

edwinhu commented Apr 12, 2014

jreback commented Apr 12, 2014

jseabold commented Apr 12, 2014

edwinhu commented Apr 12, 2014

jreback commented Mar 8, 2015

rsdenijs commented Feb 12, 2016

jorisvandenbossche commented Feb 12, 2016