Skip to content

pd.stats.api.ols inconsistent estimates #6874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
edwinhu opened this issue Apr 12, 2014 · 11 comments
Closed

pd.stats.api.ols inconsistent estimates #6874

edwinhu opened this issue Apr 12, 2014 · 11 comments

Comments

@edwinhu
Copy link

edwinhu commented Apr 12, 2014

I am running into an issue trying to run OLS using pandas 0.13.1.

Here is a simple example: I want to regress a variable on itself, in this case excess returns. The intercept should be 0, and the coefficient should be 1. pandas provides the wrong estimates, while statsmodels gives the correct estimates.

This is not due to the silly regression specification, as I have noticed the pandas.ols estimates are inconsistent for other specifications as well.

Has anyone else encountered this problem?

import pandas as pd
import statsmodels.formula.api

In [1]: pd.ols(y=test.exret,x=test.exret).beta
Out[1]: 
x            0.003107
intercept    0.006438
dtype: float64

In [2]: sm.ols(formula="exret ~ exret", data=test).fit().params
Out[2]: 
Intercept   -3.469447e-18
exret        1.000000e+00
dtype: float64

In [3]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.11.0-19-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
sqlalchemy: 0.9.2
lxml: 3.3.1
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None
@jreback
Copy link
Contributor

jreback commented Apr 12, 2014

seems ok to me

In [7]: x = Series(np.random.randn(100))

In [8]: pd.ols(y=x,x=x)
Out[8]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         100
Number of Degrees of Freedom:   2

R-squared:         1.0000
Adj R-squared:     1.0000

Rmse:              0.0000

F-stat (1, 98):        inf, p-value:     0.0000

Degrees of Freedom: model 1, resid 98

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     1.0000     0.0000 45335499035463352.00     0.0000     1.0000     1.0000
     intercept     0.0000     0.0000       0.65     0.5184    -0.0000     0.0000
---------------------------------End of Summary---------------------------------

In [9]: pd.ols(y=x,x=x).beta
Out[9]: 
x            1.000000e+00
intercept    1.277919e-17
dtype: float64
In [12]: sm.ols(formula="x ~ x", data=x).fit().params
Out[12]: 
Intercept    3.237783e-17
x            1.000000e+00
dtype: float64

@edwinhu
Copy link
Author

edwinhu commented Apr 12, 2014

I see the same thing using your example. However it seems the issue occurs when there are row labels.

Is this the expected behavior?

In [1]: a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
Out [1]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         25
Number of Degrees of Freedom:   2

R-squared:        -0.0000
Adj R-squared:    -0.0435

Rmse:              0.5372

F-stat (1, 23):    -0.0000, p-value:     1.0000

Degrees of Freedom: model 1, resid 23

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.0000     0.2085       0.00     1.0000    -0.4087     0.4087
     intercept     0.0832     0.1088       0.76     0.4523    -0.1301     0.2965
---------------------------------End of Summary---------------------------------

In [2]: b = a.reset_index()

In [3]: pd.ols(y=b[0],x=b[0])
Out [3]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   2

R-squared:         1.0000
Adj R-squared:     1.0000

Rmse:              0.0000

F-stat (1, 3):        inf, p-value:     0.0000

Degrees of Freedom: model 1, resid 3

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     1.0000     0.0000 11982830741228190.00     0.0000     1.0000     1.0000
     intercept     0.0000     0.0000       0.64     0.5693    -0.0000     0.0000
---------------------------------End of Summary---------------------------------


@jreback
Copy link
Contributor

jreback commented Apr 12, 2014

have duplicate labels rarely makes sense
how would expect it to align the data?

should prob raise an error with a duplicate index

I don't know what statsmodels does in this case

@jreback
Copy link
Contributor

jreback commented Apr 12, 2014

@jseabold does patsy/sm align on the index?

@edwinhu
Copy link
Author

edwinhu commented Apr 12, 2014

I noticed this issue when using groupby and ols with an indexed DataFrame.

GroupBy splits have "duplicate" row labels. I noticed this issue when
applying pd.ols to a GroupBy object.

It seems that sm correctly ignores the duplicate row labels.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2014

would be helpful to show some code
groupby in general won't produce a duplicate indexed frame

@jseabold
Copy link
Contributor

@edwinhu
Copy link
Author

edwinhu commented Apr 12, 2014

Sure. My data is organized by id and date. I have the dataframe indexed by id. It looks something like this (without the date column):

a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
b = pd.Series(np.random.randn(5),index=['b','b','b','b','b'])

x = a.append(b)

grp = x.groupby(level=0)

grp.apply(lambda x: pd.ols(y=x,x=x).beta)

a  x            0.000000e+00
   intercept    9.327435e-02
b  x           -8.673617e-17
   intercept    3.037757e-01
dtype: float64

sm.ols(formula="x ~ x",data=x).fit().params

Intercept    6.245005e-17
x            1.000000e+00
dtype: float64

@jreback
Copy link
Contributor

jreback commented Mar 8, 2015

refering to statsmodels as this functionaility is not supported (not deprecated either as of yet).

@jreback jreback closed this as completed Mar 8, 2015
@jreback jreback added the Stats label Mar 8, 2015
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Mar 8, 2015
@rsdenijs
Copy link

What is the status of this issue?
Duplicate indices silently mess up the number of observations resulting model. I guess a check for df.index.is_unique would solve this?

@jorisvandenbossche
Copy link
Member

@rsdenijs As @jreback pointed out in his last comment, this is not supported anymore in pandas (they will also be effectively deprecated in the coming release, see #11898). So the status of this issue is that we do not plan to take any action on this.

Can you use statsmodels for your use case? (for OLS everything should be in statsmodels, for the other functions in pandas there are still some things missing in statsmodels: statsmodels/statsmodels#2745)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants