DOC: stats docs

wesm · wesm · commit 174422ec3e51 · 2011-09-07T20:02:44.000-04:00
diff --git a/RELEASE.rst b/RELEASE.rst
@@ -250,6 +250,8 @@ Release notes
   * NumPy 1.4 through 1.6 compatibility fixes
   * Fixed bug in bias correction in `rolling_cov`, was affecting `rolling_corr`
     too
+  * R-square value was incorrect in the presence of fixed and time effects in
+    the `PanelOLS` classes
 
 Thanks
 ------
diff --git a/doc/source/stats.rst b/doc/source/stats.rst
@@ -12,9 +12,10 @@
    randn = np.random.randn
    np.set_printoptions(precision=4, suppress=True)
    import matplotlib.pyplot as plt
+   plt.close('all')
 
 **********************************
-Built-in statistical functionality
+Moving window stats and regression
 **********************************
 
 .. currentmodule:: pandas
@@ -190,14 +191,139 @@ Panel).
 Standard OLS regression
 ~~~~~~~~~~~~~~~~~~~~~~~
 
-Let's create some sample
+Let's pull in some sample data:
+
+.. ipython:: python
+
+   from pandas.io.data import DataReader
+   symbols = ['MSFT', 'GOOG', 'AAPL']
+   data = dict((sym, DataReader(sym, "yahoo"))
+               for sym in symbols)
+   panel = Panel(data).swapaxes('items', 'minor')
+   close_px = panel['close']
+
+   # convert closing prices to returns
+   rets = close_px / close_px.shift(1) - 1
+   rets.info()
+
+Let's do a static regression of ``AAPL`` returns on ``GOOG`` returns:
+
+.. ipython:: python
+
+   model = ols(y=rets['AAPL'], x=rets.ix[:, ['GOOG']])
+   model
+   model.beta
+
+If we had passed a Series instead of a DataFrame with the single ``GOOG``
+column, the model would have assigned the generic name ``x`` to the sole
+right-hand side variable.
+
+We can do a moving window regression to see how the relationship changes over
+time:
+
+.. ipython:: python
+   :suppress:
+
+   plt.close('all')
+
+.. ipython:: python
+
+   model = ols(y=rets['AAPL'], x=rets.ix[:, ['GOOG']],
+               window=250)
+
+   # just plot the coefficient for GOOG
+   @savefig moving_lm_ex.png width=5in
+   model.beta['GOOG'].plot()
+
+It looks like there are some outliers rolling in and out of the window in the
+above regression, influencing the results. We could perform a simple
+`winsorization <http://en.wikipedia.org/wiki/Winsorising>`__ at the 3 STD level
+to trim the impact of outliers:
+
+.. ipython:: python
+   :suppress:
+
+   plt.close('all')
+
+.. ipython:: python
+
+   winz = rets.copy()
+   std_1year = rolling_std(rets, 250, min_periods=20)
+
+   # cap at 3 * 1 year standard deviation
+   cap_level = 3 * np.sign(winz) * std_1year
+   winz[np.abs(winz) > 3 * std_1year] = cap_level
+
+   winz_model = ols(y=winz['AAPL'], x=winz.ix[:, ['GOOG']],
+               window=250)
+
+   model.beta['GOOG'].plot(label="With outliers")
+
+   @savefig moving_lm_winz.png width=5in
+   winz_model.beta['GOOG'].plot(label="Winsorized"); plt.legend(loc='best')
+
+So in this simple example we see the impact of winsorization is actually quite
+significant. Note the correlation after winsorization remains high:
 
 .. ipython:: python
 
+   winz.corrwith(rets)
+
+Multiple regressions can be run by passing a DataFrame for
 
 Panel regression
 ~~~~~~~~~~~~~~~~
 
 We've implemented moving window panel regression on potentially unbalanced
-panel data (see the linked Wikipedia article above if this means nothing to
-you).
+panel data (see `this article <http://en.wikipedia.org/wiki/Panel_data>`__ if
+this means nothing to you). Suppose we wanted to model the relationship between
+the magnitude of the daily return and trading volume among a group of stocks,
+and we want to pool all the data together to run one big regression. This is
+actually quite easy:
+
+.. ipython:: python
+
+   # make the units somewhat comparable
+   volume = panel['volume'] / 1e8
+   model = ols(y=volume, x={'return' : np.abs(rets)})
+   model
+
+In a panel model, we can insert dummy (0-1) variables for the "entities"
+involved (here, each of the stocks) to account the a entity-specific effect
+(intercept):
+
+.. ipython:: python
+
+   fe_model = ols(y=volume, x={'return' : np.abs(rets)},
+                  entity_effects=True)
+   fe_model
+
+Because we ran the regression with an intercept, one of the dummy variables
+must be dropped or the design matrix will not be full rank. If we do not use an
+intercept, all of the dummy variables will be included:
+
+.. ipython:: python
+
+   fe_model = ols(y=volume, x={'return' : np.abs(rets)},
+                  entity_effects=True, intercept=False)
+   fe_model
+
+We can also include *time effects*, which demeans the data cross-sectionally at
+each point in time (equivalent to including dummy variables for each
+date). More mathematical care must be taken to properly compute the standard
+errors in this case:
+
+.. ipython:: python
+
+   te_model = ols(y=volume, x={'return' : np.abs(rets)},
+                  time_effects=True, entity_effects=True)
+   te_model
+
+Here the intercept (the mean term) is dropped by default because it will be 0
+according to the model assumptions, having subtracted off the group means.
+
+Result fields and tests
+~~~~~~~~~~~~~~~~~~~~~~~
+
+We'll leave it to the user to explore the docstrings and source, especially as
+we'll be moving this code into statsmodels in the near future.