-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
MIGRATE: move stats code to statsmodels / deprecate in pandas #6077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @jseabold cc @josef-pkt |
Any thoughts on requirements/wishes with respect to backwards compatibility or easy transition to a new interface from the perspective of pandas users? From what I remember when I looked at this years ago, I don't know how much has changed: OLS is just a moving/expanding wrapper, where we only need to have the slicing code additionally to what statsmodels already has. PLM is fixed effects panel data, adding the fixed effect dummies is quite easy, but we have the within estimator in the panel PR that demeans and avoids the explicit fixed effect dummies. panel hac robust standard errors are available but are not the default and have most likely different degrees of freedom correction. ( I never compared.) And interface to robust standard errors in statsmodels is still not convenient enough. VAR in pandas: I don't know anything about that. From a quick look it needs a check to see what is different or whether there are any differences to the statsmodels version. I never managed to figure out fama_macbeth in pandas (and I don't really know what it's supposed to do). Test coverage, last time I looked, might have been non-existing. From a quick look at the source, there are still only a few smoke tests. So it looks to me that fama_macbeth needs verifying unit tests before we would merge it into statsmodels. The rest is mainly the wrapping around existing verified statsmodels code. |
@josef-pkt not really sure what if anything is actually used anymore. I personally used to use some of the OLS stuff but switched to statsmodels a while back. I think that its prob easiest / best for you guys to take what you want / think should be in statsmodels if you dont' have it and it makes sense. I think pandas can easily put up a deprecation warning prob in 0.14, but not actually remove for a bit after. I don't think it makes sense to support any of the listed items going forward in pandas; a possibility is to provide wrappers back to statsmodels (though your API is slightly different), so not convinced we should do this. If all of this code is deprecated and removed, I think pandas could actually remove statsmodels as a dep entirely; pandas obviously would remain a dep of statsmodels. Would remove the circular dep in any event. |
with respect to the deprecation comments in #5884 and here statsmodels doesn't support any of the moving/expanding ols parts out of the box. Although it's just looping anyway. I don't think it would be a lot of work to transfer this, but it depends on whether there is really a demand for it to be worth the effort. Except for the untested fama_macbeth it's mainly a maintenance issue, since all the pandas integration parts are so far almost exclusively maintained by Skipper. I think for all the statistics/econometrics there should be already or soon a equivalent statsmodels functionality except for maybe a few details. |
statsmodels is emphatically not a dep of pandas. it never has been and there has never been We do allow ourselves to use statsmodels as a dependency for testing and cross-validation only, that's all afaik, There is no user-facing code in pandas that relies on statsmodels. there shouldn't be. |
FWIW, from people I've talked to and comments I've seen I do think some of this code is used in production environments, so we definitely need a deprecation cycle. I'm not too concerned about API compatibility provided you the deprecation cycle is long enough (time based maybe not release based since pandas is much faster than statsmodels). My thinking was always to keep a similar, separate rolling ols function and to fold the panel stuff into the to-be-merged statsmodels/statsmodels#1133 with the new API there. |
mis spoke about the dep - it's purely optional in pandas of course prob makes sense to put up a depr warning in 0.14 - release looking at 3 months out @jseabold if that looks too aggressive then can defer depr to 0.15 removal can be 1-2 release after |
I hope you guys don't mind this note of interjection. I noticed the following: I never managed to figure out fama_macbeth in pandas (and I don't really know what it's supposed to do). I use pandas and I use Fama-MacBeth so I thought I could shed some light. Fama-MacBeth is still a well used econometric method in academic finance (less so than it was 10 years ago but still pretty common). It's most common on the asset pricing research side of things. It's a cross-sectional regression method that is meant to handle specifications where there is cross-correlation among the entities (usually stocks). The closest analog to Fama-MacBeth is a pooled regression with time fixed effects and standard errors adjusted for clustering on time as well (they are not the same econometrically but they tend to give similar results and are meant to overcome similar problems with the data). Here is a typical case. A panel with all US stocks over time (say monthly 1963:07 - 2013:03):
I want to look at cross-section relation between future returns and market-cap and book-to-market (pretty close to what Fama and French did in 1992): r_{i,t+1) = a + b1_log(mcap_{it}) + b2_log(BM_{it}) + e_{it}, If I estimate with Fama-MacBeth, there are two stages:
Here is comparison with a monthly pooled regressions with month fixed effects and standard errors clustered by month:
|
@kdiether Thanks that's useful Do you have an example script for your use of fama_macbeth? If having something close to fama macbeth is enough, than we could just rely on standard panel methods, with fixed effects and similar automatic support for pandas, and we already have cluster robust, two-way cluster robust, panel HAC and Driscoll–Kraay, mostly consistent with Stata available. (Defaults for small sample corrections and degrees of freedom are still a bit of an open question. They are similar but not all the same as Stata's.) (edited, because I hit enter by accident) |
@josef-pkt Yep, the Petersen paper discusses how Fama-MacBeth differs from other panel data methods. Also, Cochrane's Asset Pricing book has a discussion of the econometrics of Fama-MacBeth. Fama's (1976) Foundations of Finance has a discussion of it too (the disucssion of how the coefficients from Fama-MacBeth are equivalent to zero cost portfolios is useful). Mitch's paper is probably the most useful reference in this context. |
(I edited my previous comment) |
Do you have an example script for your use of fama_macbeth? Sure, no problem. I just need to switch what I did above from CRSP/Compustat data to something using publicly available data. I get the same estimates using panda's Fama-MacBeth and Mitch Petersen's Fama-MacBeth ado stata module: Pandas
Stata
Fama-MacBeth would have been used a lot by some of the guys at AQR so my guess is the pandas implementation of it has very good numerical accuracy. |
@josef-pkt In terms of whether you want it statsmodels, it really depends on how many academic finance types (or quant types with academic backgrounds) you have in the userbase. Even my recent papers still use Fama-MacBeth. For me, personally, I would really like to see it still available. |
Here is an example. It is an industry momentum F-M regression using Ken French's (http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) 17 industry monthly portfolio returns from 2010-2013 (if you use a longer sample this will give a nice significant relation). The specification is the following: r_{i,t+1) = a + b1*(r_{i,t-12:t-2}) + e_{it}, Data:
Code for the example: #!/usr/bin/env python
from datetime import datetime
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
def cumret(ret,beg,end):
return pd.rolling_sum(np.log(1+ret),window=beg-end+1).shift(end)
if __name__ == '__main__':
parse = lambda x: datetime(int(x[0:4]),int(x[4:6]),1)
ind = pd.read_csv('data/test.csv',index_col='caldt',
parse_dates='caldt',date_parser=parse)
ind = ind.stack().reset_index()
ind.columns = ['caldt','industry','ret']
ind['ret'] = ind['ret']/100
ind.set_index(['caldt','industry'],inplace=True)
ind['rmom'] = ind['ret'].groupby(level='industry').apply(cumret,12,2);
ind = ind[ind.rmom.notnull()]
print ind.head(20)
print pd.fama_macbeth(y=ind['ret'],x=ind.ix[:,['rmom']]) The output
Note, the reported number of estimated time-series coefficients (#betas) is 36 (in the first stage, 36 cross-sectional regression are estimated). There are 48 months but I lost 12 months for each industry computing the return from t-12 to t-2. For comparison here are the F-M results if I estimate the same thing using data from 1926-2013:
|
@kdiether Sorry, I forgot to reply here. Just a question of manpower and who is doing the work. |
let's deprecate some of these in 0.18.0 |
related #5884
The text was updated successfully, but these errors were encountered: