Skip to content

Latest commit

 

History

History
212 lines (139 loc) · 6.09 KB

computation.rst

File metadata and controls

212 lines (139 loc) · 6.09 KB

{{ header }}

Computational tools

Statistical functions

Percent change

Series and DataFrame have a method :meth:`~DataFrame.pct_change` to compute the percent change over a given number of periods (using fill_method to fill NA/null values before computing the percent change).

.. ipython:: python

   ser = pd.Series(np.random.randn(8))

   ser.pct_change()

.. ipython:: python

   df = pd.DataFrame(np.random.randn(10, 4))

   df.pct_change(periods=3)

Covariance

:meth:`Series.cov` can be used to compute covariance between series (excluding missing values).

.. ipython:: python

   s1 = pd.Series(np.random.randn(1000))
   s2 = pd.Series(np.random.randn(1000))
   s1.cov(s2)

Analogously, :meth:`DataFrame.cov` to compute pairwise covariances among the series in the DataFrame, also excluding NA/null values.

Note

Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

.. ipython:: python

   frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
   frame.cov()

DataFrame.cov also supports an optional min_periods keyword that specifies the required minimum number of observations for each column pair in order to have a valid result.

.. ipython:: python

   frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
   frame.loc[frame.index[:5], "a"] = np.nan
   frame.loc[frame.index[5:10], "b"] = np.nan

   frame.cov()

   frame.cov(min_periods=12)


Correlation

Correlation may be computed using the :meth:`~DataFrame.corr` method. Using the method parameter, several methods for computing correlations are provided:

Method name Description
pearson (default) Standard correlation coefficient
kendall Kendall Tau correlation coefficient
spearman Spearman rank correlation coefficient

All of these are currently computed using pairwise complete observations. Wikipedia has articles covering the above correlation coefficients:

Note

Please see the :ref:`caveats <computation.covariance.caveats>` associated with this method of calculating correlation matrices in the :ref:`covariance section <computation.covariance>`.

.. ipython:: python

   frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
   frame.iloc[::2] = np.nan

   # Series with Series
   frame["a"].corr(frame["b"])
   frame["a"].corr(frame["b"], method="spearman")

   # Pairwise correlation of DataFrame columns
   frame.corr()

Note that non-numeric columns will be automatically excluded from the correlation calculation.

Like cov, corr also supports the optional min_periods keyword:

.. ipython:: python

   frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
   frame.loc[frame.index[:5], "a"] = np.nan
   frame.loc[frame.index[5:10], "b"] = np.nan

   frame.corr()

   frame.corr(min_periods=12)


The method argument can also be a callable for a generic correlation calculation. In this case, it should be a single function that produces a single value from two ndarray inputs. Suppose we wanted to compute the correlation based on histogram intersection:

.. ipython:: python

   # histogram intersection
   def histogram_intersection(a, b):
       return np.minimum(np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum()


   frame.corr(method=histogram_intersection)

A related method :meth:`~DataFrame.corrwith` is implemented on DataFrame to compute the correlation between like-labeled Series contained in different DataFrame objects.

.. ipython:: python

   index = ["a", "b", "c", "d", "e"]
   columns = ["one", "two", "three", "four"]
   df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)
   df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
   df1.corrwith(df2)
   df2.corrwith(df1, axis=1)

Data ranking

The :meth:`~Series.rank` method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:

.. ipython:: python

   s = pd.Series(np.random.randn(5), index=list("abcde"))
   s["d"] = s["b"]  # so there's a tie
   s.rank()

:meth:`~DataFrame.rank` is also a DataFrame method and can rank either the rows (axis=0) or the columns (axis=1). NaN values are excluded from the ranking.

.. ipython:: python

   df = pd.DataFrame(np.random.randn(10, 6))
   df[4] = df[2][:5]  # some ties
   df
   df.rank(1)

rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.

rank supports different tie-breaking methods, specified with the method parameter:

  • average : average rank of tied group
  • min : lowest rank in the group
  • max : highest rank in the group
  • first : ranks assigned in the order they appear in the array

Windowing functions

See :ref:`the window operations user guide <window.overview>` for an overview of windowing functions.