diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index e591825cec748..be749dfc1f594 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -24,9 +24,9 @@ See the :ref:`Indexing and Selecting Data ` for general indexing docum Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called ``chained assignment`` and should be avoided. See :ref:`Returning a View versus Copy - ` + `. -See the :ref:`cookbook` for some advanced strategies +See the :ref:`cookbook` for some advanced strategies. .. _advanced.hierarchical: @@ -46,7 +46,7 @@ described above and in prior sections. Later, when discussing :ref:`group by non-trivial applications to illustrate how it aids in structuring data for analysis. -See the :ref:`cookbook` for some advanced strategies +See the :ref:`cookbook` for some advanced strategies. Creating a MultiIndex (hierarchical index) object ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -59,7 +59,7 @@ can think of ``MultiIndex`` as an array of tuples where each tuple is unique. A ``MultiIndex.from_tuples``), or a crossed set of iterables (using ``MultiIndex.from_product``). The ``Index`` constructor will attempt to return a ``MultiIndex`` when it is passed a list of tuples. The following examples -demo different ways to initialize MultiIndexes. +demonstrate different ways to initialize MultiIndexes. .. ipython:: python @@ -196,7 +196,8 @@ highly performant. If you want to see the actual used levels. # for a specific level df[['foo','qux']].columns.get_level_values(0) -To reconstruct the ``MultiIndex`` with only the used levels +To reconstruct the ``MultiIndex`` with only the used levels, the +``remove_unused_levels`` method may be used. .. versionadded:: 0.20.0 @@ -216,7 +217,7 @@ tuples: s + s[:-2] s + s[::2] -``reindex`` can be called with another ``MultiIndex`` or even a list or array +``reindex`` can be called with another ``MultiIndex``, or even a list or array of tuples: .. ipython:: python @@ -230,7 +231,7 @@ Advanced indexing with hierarchical index ----------------------------------------- Syntactically integrating ``MultiIndex`` in advanced indexing with ``.loc`` is a -bit challenging, but we've made every effort to do so. for example the +bit challenging, but we've made every effort to do so. For example the following works as you would expect: .. ipython:: python @@ -286,7 +287,7 @@ As usual, **both sides** of the slicers are included as this is label indexing. df.loc[(slice('A1','A3'),.....), :] -   rather than this: +   You should **not** do this:   .. code-block:: python @@ -315,7 +316,7 @@ Basic multi-index slicing using slices, lists, and labels. dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :] -You can use a ``pd.IndexSlice`` to have a more natural syntax using ``:`` rather than using ``slice(None)`` +You can use :class:`pandas.IndexSlice` to facilitate a more natural syntax using ``:``, rather than using ``slice(None)``. .. ipython:: python @@ -344,7 +345,7 @@ slicers on a single axis. dfmi.loc(axis=0)[:, :, ['C1', 'C3']] -Furthermore you can *set* the values using these methods +Furthermore you can *set* the values using the following methods. .. ipython:: python @@ -379,7 +380,7 @@ selecting data at a particular level of a MultiIndex easier. df.loc[(slice(None),'one'),:] You can also select on the columns with :meth:`~pandas.MultiIndex.xs`, by -providing the axis argument +providing the axis argument. .. ipython:: python @@ -391,7 +392,7 @@ providing the axis argument # using the slicers df.loc[:,(slice(None),'one')] -:meth:`~pandas.MultiIndex.xs` also allows selection with multiple keys +:meth:`~pandas.MultiIndex.xs` also allows selection with multiple keys. .. ipython:: python @@ -403,13 +404,13 @@ providing the axis argument df.loc[:,('bar','one')] You can pass ``drop_level=False`` to :meth:`~pandas.MultiIndex.xs` to retain -the level that was selected +the level that was selected. .. ipython:: python df.xs('one', level='second', axis=1, drop_level=False) -versus the result with ``drop_level=True`` (the default value) +Compare the above with the result using ``drop_level=True`` (the default value). .. ipython:: python @@ -470,7 +471,7 @@ allowing you to permute the hierarchical index levels in one step: Sorting a :class:`~pandas.MultiIndex` ------------------------------------- -For MultiIndex-ed objects to be indexed & sliced effectively, they need +For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use ``sort_index``. .. ipython:: python @@ -623,7 +624,8 @@ Index Types ----------- We have discussed ``MultiIndex`` in the previous sections pretty extensively. ``DatetimeIndex`` and ``PeriodIndex`` -are shown :ref:`here `. ``TimedeltaIndex`` are :ref:`here `. +are shown :ref:`here `, and information about +`TimedeltaIndex`` is found :ref:`here `. In the following sub-sections we will highlight some other index types. @@ -647,7 +649,7 @@ and allows efficient indexing and storage of an index with a large number of dup df.dtypes df.B.cat.categories -Setting the index, will create a ``CategoricalIndex`` +Setting the index will create a ``CategoricalIndex``. .. ipython:: python @@ -655,36 +657,38 @@ Setting the index, will create a ``CategoricalIndex`` df2.index Indexing with ``__getitem__/.iloc/.loc`` works similarly to an ``Index`` with duplicates. -The indexers MUST be in the category or the operation will raise. +The indexers **must** be in the category or the operation will raise a ``KeyError``. .. ipython:: python df2.loc['a'] -These PRESERVE the ``CategoricalIndex`` +The ``CategoricalIndex`` is **preserved** after indexing: .. ipython:: python df2.loc['a'].index -Sorting will order by the order of the categories +Sorting the index will sort by the order of the categories (Recall that we +created the index with with ``CategoricalDtype(list('cab'))``, so the sorted +order is ``cab``.). .. ipython:: python df2.sort_index() -Groupby operations on the index will preserve the index nature as well +Groupby operations on the index will preserve the index nature as well. .. ipython:: python df2.groupby(level=0).sum() df2.groupby(level=0).sum().index -Reindexing operations, will return a resulting index based on the type of the passed -indexer, meaning that passing a list will return a plain-old-``Index``; indexing with +Reindexing operations will return a resulting index based on the type of the passed +indexer. Passing a list will return a plain-old ``Index``; indexing with a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories -of the PASSED ``Categorical`` dtype. This allows one to arbitrarily index these even with -values NOT in the categories, similarly to how you can reindex ANY pandas index. +of the **passed** ``Categorical`` dtype. This allows one to arbitrarily index these even with +values **not** in the categories, similarly to how you can reindex **any** pandas index. .. ipython :: python @@ -720,7 +724,8 @@ Int64Index and RangeIndex Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see :ref:`here `. -``Int64Index`` is a fundamental basic index in *pandas*. This is an Immutable array implementing an ordered, sliceable set. +``Int64Index`` is a fundamental basic index in pandas. +This is an Immutable array implementing an ordered, sliceable set. Prior to 0.18.0, the ``Int64Index`` would provide the default index for all ``NDFrame`` objects. ``RangeIndex`` is a sub-class of ``Int64Index`` added in version 0.18.0, now providing the default index for all ``NDFrame`` objects. @@ -742,7 +747,7 @@ same. sf = pd.Series(range(5), index=indexf) sf -Scalar selection for ``[],.loc`` will always be label based. An integer will match an equal float index (e.g. ``3`` is equivalent to ``3.0``) +Scalar selection for ``[],.loc`` will always be label based. An integer will match an equal float index (e.g. ``3`` is equivalent to ``3.0``). .. ipython:: python @@ -751,15 +756,17 @@ Scalar selection for ``[],.loc`` will always be label based. An integer will mat sf.loc[3] sf.loc[3.0] -The only positional indexing is via ``iloc`` +The only positional indexing is via ``iloc``. .. ipython:: python sf.iloc[3] -A scalar index that is not found will raise ``KeyError`` +A scalar index that is not found will raise a ``KeyError``. -Slicing is ALWAYS on the values of the index, for ``[],ix,loc`` and ALWAYS positional with ``iloc`` +Slicing is primarily on the values of the index when using ``[],ix,loc``, and +**always** positional when using ``iloc``. The exception is when the slice is +boolean, in which case it will always be positional. .. ipython:: python @@ -767,14 +774,14 @@ Slicing is ALWAYS on the values of the index, for ``[],ix,loc`` and ALWAYS posit sf.loc[2:4] sf.iloc[2:4] -In float indexes, slicing using floats is allowed +In float indexes, slicing using floats is allowed. .. ipython:: python sf[2.1:4.6] sf.loc[2.1:4.6] -In non-float indexes, slicing using floats will raise a ``TypeError`` +In non-float indexes, slicing using floats will raise a ``TypeError``. .. code-block:: ipython @@ -786,7 +793,7 @@ In non-float indexes, slicing using floats will raise a ``TypeError`` .. warning:: - Using a scalar float indexer for ``.iloc`` has been removed in 0.18.0, so the following will raise a ``TypeError`` + Using a scalar float indexer for ``.iloc`` has been removed in 0.18.0, so the following will raise a ``TypeError``: .. code-block:: ipython @@ -816,13 +823,13 @@ Selection operations then will always work on a value basis, for all selection o dfir.loc[0:1001,'A'] dfir.loc[1000.4] -You could then easily pick out the first 1 second (1000 ms) of data then. +You could retrieve the first 1 second (1000 ms) of data as such: .. ipython:: python dfir[0:1000] -Of course if you need integer based selection, then use ``iloc`` +If you need integer based selection, you should use ``iloc``: .. ipython:: python @@ -975,6 +982,7 @@ consider the following Series: s Suppose we wished to slice from ``c`` to ``e``, using integers this would be +accomplished as such: .. ipython:: python diff --git a/doc/source/basics.rst b/doc/source/basics.rst index f9995472866ed..da82f56d315e6 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -436,7 +436,7 @@ General DataFrame Combine ~~~~~~~~~~~~~~~~~~~~~~~~~ The :meth:`~DataFrame.combine_first` method above calls the more general -DataFrame method :meth:`~DataFrame.combine`. This method takes another DataFrame +:meth:`DataFrame.combine`. This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same). @@ -540,8 +540,8 @@ will exclude NAs on Series input by default: np.mean(df['one']) np.mean(df['one'].values) -``Series`` also has a method :meth:`~Series.nunique` which will return the -number of unique non-NA values: +:meth:`Series.nunique` will return the number of unique non-NA values in a +Series: .. ipython:: python @@ -852,7 +852,8 @@ Aggregation API The aggregation API allows one to express possibly multiple aggregation operations in a single concise way. This API is similar across pandas objects, see :ref:`groupby API `, the :ref:`window functions API `, and the :ref:`resample API `. -The entry point for aggregation is the method :meth:`~DataFrame.aggregate`, or the alias :meth:`~DataFrame.agg`. +The entry point for aggregation is :meth:`DataFrame.aggregate`, or the alias +:meth:`DataFrame.agg`. We will use a similar starting frame from above: @@ -1913,8 +1914,8 @@ dtype of the column will be chosen to accommodate all of the data types # string data forces an ``object`` dtype pd.Series([1, 2, 3, 6., 'foo']) -The method :meth:`~DataFrame.get_dtype_counts` will return the number of columns of -each type in a ``DataFrame``: +The number of columns of each type in a ``DataFrame`` can be found by calling +:meth:`~DataFrame.get_dtype_counts`. .. ipython:: python diff --git a/doc/source/computation.rst b/doc/source/computation.rst index a6bc9431d3bcc..0994d35999191 100644 --- a/doc/source/computation.rst +++ b/doc/source/computation.rst @@ -26,9 +26,10 @@ Statistical Functions Percent Change ~~~~~~~~~~~~~~ -``Series``, ``DataFrame``, and ``Panel`` all have a method ``pct_change`` to compute the -percent change over a given number of periods (using ``fill_method`` to fill -NA/null values *before* computing the percent change). +``Series``, ``DataFrame``, and ``Panel`` all have a method +:meth:`~DataFrame.pct_change` to compute the percent change over a given number +of periods (using ``fill_method`` to fill NA/null values *before* computing +the percent change). .. ipython:: python @@ -47,8 +48,8 @@ NA/null values *before* computing the percent change). Covariance ~~~~~~~~~~ -The ``Series`` object has a method ``cov`` to compute covariance between series -(excluding NA/null values). +:meth:`Series.cov` can be used to compute covariance between series +(excluding missing values). .. ipython:: python @@ -56,8 +57,9 @@ The ``Series`` object has a method ``cov`` to compute covariance between series s2 = pd.Series(np.random.randn(1000)) s1.cov(s2) -Analogously, ``DataFrame`` has a method ``cov`` to compute pairwise covariances -among the series in the DataFrame, also excluding NA/null values. +Analogously, :meth:`DataFrame.cov` to compute +pairwise covariances among the series in the DataFrame, also excluding +NA/null values. .. _computation.covariance.caveats: @@ -97,7 +99,9 @@ in order to have a valid result. Correlation ~~~~~~~~~~~ -Several methods for computing correlations are provided: +Correlation may be computed using the :meth:`~DataFrame.corr` method. +Using the ``method`` parameter, several methods for computing correlations are +provided: .. csv-table:: :header: "Method name", "Description" @@ -110,6 +114,11 @@ Several methods for computing correlations are provided: .. \rho = \cov(x, y) / \sigma_x \sigma_y All of these are currently computed using pairwise complete observations. +Wikipedia has articles covering the above correlation coefficients: + +* `Pearson correlation coefficient `_ +* `Kendall rank correlation coefficient `_ +* `Spearman's rank correlation coefficient `_ .. note:: @@ -145,9 +154,9 @@ Like ``cov``, ``corr`` also supports the optional ``min_periods`` keyword: frame.corr(min_periods=12) -A related method ``corrwith`` is implemented on DataFrame to compute the -correlation between like-labeled Series contained in different DataFrame -objects. +A related method :meth:`~DataFrame.corrwith` is implemented on DataFrame to +compute the correlation between like-labeled Series contained in different +DataFrame objects. .. ipython:: python @@ -163,8 +172,8 @@ objects. Data ranking ~~~~~~~~~~~~ -The ``rank`` method produces a data ranking with ties being assigned the mean -of the ranks (by default) for the group: +The :meth:`~Series.rank` method produces a data ranking with ties being +assigned the mean of the ranks (by default) for the group: .. ipython:: python @@ -172,8 +181,9 @@ of the ranks (by default) for the group: s['d'] = s['b'] # so there's a tie s.rank() -``rank`` is also a DataFrame method and can rank either the rows (``axis=0``) -or the columns (``axis=1``). ``NaN`` values are excluded from the ranking. +:meth:`~DataFrame.rank` is also a DataFrame method and can rank either the rows +(``axis=0``) or the columns (``axis=1``). ``NaN`` values are excluded from the +ranking. .. ipython:: python @@ -205,7 +215,7 @@ Window Functions Prior to version 0.18.0, ``pd.rolling_*``, ``pd.expanding_*``, and ``pd.ewm*`` were module level functions and are now deprecated. These are replaced by using the :class:`~pandas.core.window.Rolling`, :class:`~pandas.core.window.Expanding` and :class:`~pandas.core.window.EWM`. objects and a corresponding method call. - The deprecation warning will show the new syntax, see an example :ref:`here ` + The deprecation warning will show the new syntax, see an example :ref:`here `. For working with data, a number of windows functions are provided for computing common *window* or *rolling* statistics. Among these are count, sum, @@ -219,7 +229,7 @@ see the :ref:`groupby docs `. .. note:: - The API for window statistics is quite similar to the way one works with ``GroupBy`` objects, see the documentation :ref:`here ` + The API for window statistics is quite similar to the way one works with ``GroupBy`` objects, see the documentation :ref:`here `. We work with ``rolling``, ``expanding`` and ``exponentially weighted`` data through the corresponding objects, :class:`~pandas.core.window.Rolling`, :class:`~pandas.core.window.Expanding` and :class:`~pandas.core.window.EWM`. @@ -289,7 +299,7 @@ sugar for applying the moving window operator to all of the DataFrame's columns: Method Summary ~~~~~~~~~~~~~~ -We provide a number of the common statistical functions: +We provide a number of common statistical functions: .. currentmodule:: pandas.core.window @@ -564,7 +574,7 @@ Computing rolling pairwise covariances and correlations .. warning:: Prior to version 0.20.0 if ``pairwise=True`` was passed, a ``Panel`` would be returned. - This will now return a 2-level MultiIndexed DataFrame, see the whatsnew :ref:`here ` + This will now return a 2-level MultiIndexed DataFrame, see the whatsnew :ref:`here `. In financial data analysis and other fields it's common to compute covariance and correlation matrices for a collection of time series. Often one is also @@ -623,7 +633,8 @@ perform multiple computations on the data. These operations are similar to the : r = dfa.rolling(window=60,min_periods=1) r -We can aggregate by passing a function to the entire DataFrame, or select a Series (or multiple Series) via standard getitem. +We can aggregate by passing a function to the entire DataFrame, or select a +Series (or multiple Series) via standard ``__getitem__``. .. ipython:: python @@ -741,14 +752,14 @@ all accept are: - ``min_periods``: threshold of non-null data points to require. Defaults to minimum needed to compute statistic. No ``NaNs`` will be output once ``min_periods`` non-null data points have been seen. -- ``center``: boolean, whether to set the labels at the center (default is False) +- ``center``: boolean, whether to set the labels at the center (default is False). .. _stats.moments.expanding.note: .. note:: The output of the ``.rolling`` and ``.expanding`` methods do not return a ``NaN`` if there are at least ``min_periods`` non-null values in the current - window. For example, + window. For example: .. ipython:: python @@ -818,7 +829,8 @@ In general, a weighted moving average is calculated as y_t = \frac{\sum_{i=0}^t w_i x_{t-i}}{\sum_{i=0}^t w_i}, -where :math:`x_t` is the input and :math:`y_t` is the result. +where :math:`x_t` is the input, :math:`y_t` is the result and the :math:`w_i` +are the weights. The EW functions support two variants of exponential weights. The default, ``adjust=True``, uses the weights :math:`w_i = (1 - \alpha)^i` @@ -931,7 +943,7 @@ average of ``3, NaN, 5`` would be calculated as .. math:: - \frac{(1-\alpha)^2 \cdot 3 + 1 \cdot 5}{(1-\alpha)^2 + 1} + \frac{(1-\alpha)^2 \cdot 3 + 1 \cdot 5}{(1-\alpha)^2 + 1}. Whereas if ``ignore_na=True``, the weighted average would be calculated as @@ -953,4 +965,4 @@ are scaled by debiasing factors (For :math:`w_i = 1`, this reduces to the usual :math:`N / (N - 1)` factor, with :math:`N = t + 1`.) See `Weighted Sample Variance `__ -for further details. +on Wikipedia for further details. diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst index 3c2fd4d959d63..b9223c6ad9f7a 100644 --- a/doc/source/indexing.rst +++ b/doc/source/indexing.rst @@ -18,16 +18,14 @@ Indexing and Selecting Data The axis labeling information in pandas objects serves many purposes: - Identifies data (i.e. provides *metadata*) using known indicators, - important for analysis, visualization, and interactive console display - - Enables automatic and explicit data alignment - - Allows intuitive getting and setting of subsets of the data set + important for analysis, visualization, and interactive console display. + - Enables automatic and explicit data alignment. + - Allows intuitive getting and setting of subsets of the data set. In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in -this area. Expect more work to be invested in higher-dimensional data -structures (including ``Panel``) in the future, especially in label-based -advanced indexing. +this area. .. note:: @@ -43,9 +41,9 @@ advanced indexing. .. warning:: Whether a copy or a reference is returned for a setting operation, may - depend on the context. This is sometimes called ``chained assignment`` and - should be avoided. See :ref:`Returning a View versus Copy - ` + depend on the context. This is sometimes called ``chained assignment`` and + should be avoided. See :ref:`Returning a View versus Copy + `. .. warning:: @@ -53,7 +51,7 @@ advanced indexing. See the :ref:`MultiIndex / Advanced Indexing ` for ``MultiIndex`` and more advanced indexing documentation. -See the :ref:`cookbook` for some advanced strategies +See the :ref:`cookbook` for some advanced strategies. .. _indexing.choice: @@ -66,21 +64,21 @@ of multi-axis indexing. - ``.loc`` is primarily label based, but may also be used with a boolean array. ``.loc`` will raise ``KeyError`` when the items are not found. Allowed inputs are: - - A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is interpreted as a + - A single label, e.g. ``5`` or ``'a'`` (Note that ``5`` is interpreted as a *label* of the index. This use is **not** an integer position along the - index) - - A list or array of labels ``['a', 'b', 'c']`` - - A slice object with labels ``'a':'f'`` (note that contrary to usual python + index.). + - A list or array of labels ``['a', 'b', 'c']``. + - A slice object with labels ``'a':'f'`` (Note that contrary to usual python slices, **both** the start and the stop are included, when present in the - index! - also see :ref:`Slicing with labels - `) + index! See :ref:`Slicing with labels + `.). - A boolean array - A ``callable`` function with one argument (the calling Series, DataFrame or Panel) and - that returns valid output for indexing (one of the above) + that returns valid output for indexing (one of the above). .. versionadded:: 0.18.1 - See more at :ref:`Selection by Label ` + See more at :ref:`Selection by Label `. - ``.iloc`` is primarily integer position based (from ``0`` to ``length-1`` of the axis), but may also be used with a boolean @@ -89,27 +87,26 @@ of multi-axis indexing. out-of-bounds indexing. (this conforms with python/numpy *slice* semantics). Allowed inputs are: - - An integer e.g. ``5`` - - A list or array of integers ``[4, 3, 0]`` - - A slice object with ints ``1:7`` - - A boolean array + - An integer e.g. ``5``. + - A list or array of integers ``[4, 3, 0]``. + - A slice object with ints ``1:7``. + - A boolean array. - A ``callable`` function with one argument (the calling Series, DataFrame or Panel) and - that returns valid output for indexing (one of the above) + that returns valid output for indexing (one of the above). .. versionadded:: 0.18.1 - See more at :ref:`Selection by Position ` - - See more at :ref:`Advanced Indexing ` and :ref:`Advanced + See more at :ref:`Selection by Position `, + :ref:`Advanced Indexing ` and :ref:`Advanced Hierarchical `. - ``.loc``, ``.iloc``, and also ``[]`` indexing can accept a ``callable`` as indexer. See more at :ref:`Selection By Callable `. Getting values from an object with multi-axes selection uses the following -notation (using ``.loc`` as an example, but applies to ``.iloc`` as +notation (using ``.loc`` as an example, but the following applies to ``.iloc`` as well). Any of the axes accessors may be the null slice ``:``. Axes left out of -the specification are assumed to be ``:``. (e.g. ``p.loc['a']`` is equiv to -``p.loc['a', :, :]``) +the specification are assumed to be ``:``, e.g. ``p.loc['a']`` is equivalent to +``p.loc['a', :, :]``. .. csv-table:: :header: "Object Type", "Indexers" @@ -128,7 +125,8 @@ Basics As mentioned when introducing the data structures in the :ref:`last section `, the primary function of indexing with ``[]`` (a.k.a. ``__getitem__`` for those familiar with implementing class behavior in Python) is selecting out -lower-dimensional slices. Thus, +lower-dimensional slices. The following table shows return type values when +indexing pandas objects with ``[]``: .. csv-table:: :header: "Object Type", "Selection", "Return Value Type" @@ -188,7 +186,7 @@ columns. df.loc[:,['B', 'A']] = df[['A', 'B']] df[['A', 'B']] - The correct way is to use raw values + The correct way to swap column values is by using raw values: .. ipython:: python @@ -310,7 +308,7 @@ Selection By Label Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called ``chained assignment`` and should be avoided. - See :ref:`Returning a View versus Copy ` + See :ref:`Returning a View versus Copy `. .. warning:: @@ -336,23 +334,23 @@ Selection By Label .. warning:: Starting in 0.21.0, pandas will show a ``FutureWarning`` if indexing with a list with missing labels. In the future - this will raise a ``KeyError``. See :ref:`list-like Using loc with missing keys in a list is Deprecated ` + this will raise a ``KeyError``. See :ref:`list-like Using loc with missing keys in a list is Deprecated `. pandas provides a suite of methods in order to have **purely label based indexing**. This is a strict inclusion based protocol. -All of the labels for which you ask, must be in the index or a ``KeyError`` will be raised! +Every label asked for must be in the index, or a ``KeyError`` will be raised. When slicing, both the start bound **AND** the stop bound are *included*, if present in the index. Integers are valid labels, but they refer to the label **and not the position**. The ``.loc`` attribute is the primary access method. The following are valid inputs: -- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is interpreted as a *label* of the index. This use is **not** an integer position along the index) -- A list or array of labels ``['a', 'b', 'c']`` -- A slice object with labels ``'a':'f'`` (note that contrary to usual python +- A single label, e.g. ``5`` or ``'a'`` (Note that ``5`` is interpreted as a *label* of the index. This use is **not** an integer position along the index.). +- A list or array of labels ``['a', 'b', 'c']``. +- A slice object with labels ``'a':'f'`` (Note that contrary to usual python slices, **both** the start and the stop are included, when present in the - index! - also See :ref:`Slicing with labels - `) -- A boolean array -- A ``callable``, see :ref:`Selection By Callable ` + index! See :ref:`Slicing with labels + `.). +- A boolean array. +- A ``callable``, see :ref:`Selection By Callable `. .. ipython:: python @@ -368,7 +366,7 @@ Note that setting works as well: s1.loc['c':] = 0 s1 -With a DataFrame +With a DataFrame: .. ipython:: python @@ -378,26 +376,26 @@ With a DataFrame df1 df1.loc[['a', 'b', 'd'], :] -Accessing via label slices +Accessing via label slices: .. ipython:: python df1.loc['d':, 'A':'C'] -For getting a cross section using a label (equiv to ``df.xs('a')``) +For getting a cross section using a label (equivalent to ``df.xs('a')``): .. ipython:: python df1.loc['a'] -For getting values with a boolean array +For getting values with a boolean array: .. ipython:: python df1.loc['a'] > 0 df1.loc[:, df1.loc['a'] > 0] -For getting a value explicitly (equiv to deprecated ``df.get_value('a','A')``) +For getting a value explicitly (equivalent to deprecated ``df.get_value('a','A')``): .. ipython:: python @@ -441,17 +439,17 @@ Selection By Position Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called ``chained assignment`` and should be avoided. - See :ref:`Returning a View versus Copy ` + See :ref:`Returning a View versus Copy `. Pandas provides a suite of methods in order to get **purely integer based indexing**. The semantics follow closely python and numpy slicing. These are ``0-based`` indexing. When slicing, the start bounds is *included*, while the upper bound is *excluded*. Trying to use a non-integer, even a **valid** label will raise an ``IndexError``. The ``.iloc`` attribute is the primary access method. The following are valid inputs: -- An integer e.g. ``5`` -- A list or array of integers ``[4, 3, 0]`` -- A slice object with ints ``1:7`` -- A boolean array -- A ``callable``, see :ref:`Selection By Callable ` +- An integer e.g. ``5``. +- A list or array of integers ``[4, 3, 0]``. +- A slice object with ints ``1:7``. +- A boolean array. +- A ``callable``, see :ref:`Selection By Callable `. .. ipython:: python @@ -467,7 +465,7 @@ Note that setting works as well: s1.iloc[:3] = 0 s1 -With a DataFrame +With a DataFrame: .. ipython:: python @@ -476,14 +474,14 @@ With a DataFrame columns=list(range(0,8,2))) df1 -Select via integer slicing +Select via integer slicing: .. ipython:: python df1.iloc[:3] df1.iloc[1:5, 2:4] -Select via integer list +Select via integer list: .. ipython:: python @@ -502,7 +500,7 @@ Select via integer list # this is also equivalent to ``df1.iat[1,1]`` df1.iloc[1, 1] -For getting a cross section using an integer position (equiv to ``df.xs(1)``) +For getting a cross section using an integer position (equiv to ``df.xs(1)``): .. ipython:: python @@ -523,7 +521,7 @@ Out of range slice indexes are handled gracefully just as in Python/Numpy. s.iloc[8:10] Note that using slices that go out of bounds can result in -an empty axis (e.g. an empty DataFrame being returned) +an empty axis (e.g. an empty DataFrame being returned). .. ipython:: python @@ -535,7 +533,7 @@ an empty axis (e.g. an empty DataFrame being returned) A single indexer that is out of bounds will raise an ``IndexError``. A list of indexers where any element is out of bounds will raise an -``IndexError`` +``IndexError``. .. code-block:: python @@ -601,7 +599,7 @@ bit of user confusion over the years. The recommended methods of indexing are: -- ``.loc`` if you want to *label* index +- ``.loc`` if you want to *label* index. - ``.iloc`` if you want to *positionally* index. .. ipython:: python @@ -612,7 +610,7 @@ The recommended methods of indexing are: dfd -Previous Behavior, where you wish to get the 0th and the 2nd elements from the index in the 'A' column. +Previous behavior, where you wish to get the 0th and the 2nd elements from the index in the 'A' column. .. code-block:: ipython @@ -635,7 +633,7 @@ This can also be expressed using ``.iloc``, by explicitly getting locations on t dfd.iloc[[0, 2], dfd.columns.get_loc('A')] -For getting *multiple* indexers, using ``.get_indexer`` +For getting *multiple* indexers, using ``.get_indexer``: .. ipython:: python @@ -824,7 +822,7 @@ Setting With Enlargement The ``.loc/[]`` operations can perform enlargement when setting a non-existent key for that axis. -In the ``Series`` case this is effectively an appending operation +In the ``Series`` case this is effectively an appending operation. .. ipython:: python @@ -833,7 +831,7 @@ In the ``Series`` case this is effectively an appending operation se[5] = 5. se -A ``DataFrame`` can be enlarged on either axis via ``.loc`` +A ``DataFrame`` can be enlarged on either axis via ``.loc``. .. ipython:: python @@ -889,7 +887,11 @@ Boolean indexing .. _indexing.boolean: Another common operation is the use of boolean vectors to filter the data. -The operators are: ``|`` for ``or``, ``&`` for ``and``, and ``~`` for ``not``. These **must** be grouped by using parentheses. +The operators are: ``|`` for ``or``, ``&`` for ``and``, and ``~`` for ``not``. +These **must** be grouped by using parentheses, since by default Python will +evaluate an expression such as ``df.A > 2 & df.B < 3`` as +``df.A > (2 & df.B) < 3``, while the desired evaluation order is +``(df.A > 2) & (df.B < 3)``. Using a boolean vector to index a Series works exactly as in a numpy ndarray: @@ -929,7 +931,7 @@ more complex criteria: # Multiple criteria df2[criterion & (df2['b'] == 'x')] -Note, with the choice methods :ref:`Selection by Label `, :ref:`Selection by Position `, +With the choice methods :ref:`Selection by Label `, :ref:`Selection by Position `, and :ref:`Advanced Indexing ` you may select along more than one axis using boolean vectors combined with other indexing expressions. .. ipython:: python @@ -941,9 +943,9 @@ and :ref:`Advanced Indexing ` you may select along more than one axis Indexing with isin ------------------ -Consider the ``isin`` method of Series, which returns a boolean vector that is -true wherever the Series elements exist in the passed list. This allows you to -select rows where one or more columns have values you want: +Consider the :meth:`~Series.isin` method of ``Series``, which returns a boolean +vector that is true wherever the ``Series`` elements exist in the passed list. +This allows you to select rows where one or more columns have values you want: .. ipython:: python @@ -973,7 +975,7 @@ in the membership check: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])] s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)] -DataFrame also has an ``isin`` method. When calling ``isin``, pass a set of +DataFrame also has an :meth:`~DataFrame.isin` method. When calling ``isin``, pass a set of values as either an array or dict. If values is an array, ``isin`` returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values. @@ -1018,13 +1020,13 @@ Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the ``where`` method in ``Series`` and ``DataFrame``. -To return only the selected rows +To return only the selected rows: .. ipython:: python s[s > 0] -To return a Series of the same shape as the original +To return a Series of the same shape as the original: .. ipython:: python @@ -1032,7 +1034,7 @@ To return a Series of the same shape as the original Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. ``where`` is used under the hood as the implementation. -Equivalent is ``df.where(df < 0)`` +The code below is equivalent to ``df.where(df < 0)``. .. ipython:: python :suppress: @@ -1087,12 +1089,12 @@ without creating a copy: Furthermore, ``where`` aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to -partial setting via ``.loc`` (but on the contents rather than the axis labels) +partial setting via ``.loc`` (but on the contents rather than the axis labels). .. ipython:: python df2 = df.copy() - df2[ df2[1:4] > 0 ] = 3 + df2[ df2[1:4] > 0] = 3 df2 Where can also accept ``axis`` and ``level`` parameters to align the input when @@ -1103,7 +1105,7 @@ performing the ``where``. df2 = df.copy() df2.where(df2>0,df2['A'],axis='index') -This is equivalent (but faster than) the following. +This is equivalent to (but faster than) the following. .. ipython:: python @@ -1123,9 +1125,11 @@ as condition and ``other`` argument. 'C': [7, 8, 9]}) df3.where(lambda x: x > 4, lambda x: x + 10) -**mask** -``mask`` is the inverse boolean operation of ``where``. +Mask +~~~~ + +:meth:`~pandas.DataFrame.mask` is the inverse boolean operation of ``where``. .. ipython:: python @@ -1134,8 +1138,8 @@ as condition and ``other`` argument. .. _indexing.query: -The :meth:`~pandas.DataFrame.query` Method (Experimental) ---------------------------------------------------------- +The :meth:`~pandas.DataFrame.query` Method +------------------------------------------ :class:`~pandas.DataFrame` objects have a :meth:`~pandas.DataFrame.query` method that allows selection using an expression. @@ -1263,7 +1267,7 @@ having to specify which frame you're interested in querying :meth:`~pandas.DataFrame.query` Python versus pandas Syntax Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Full numpy-like syntax +Full numpy-like syntax: .. ipython:: python @@ -1273,19 +1277,19 @@ Full numpy-like syntax df[(df.a < df.b) & (df.b < df.c)] Slightly nicer by removing the parentheses (by binding making comparison -operators bind tighter than ``&``/``|``) +operators bind tighter than ``&`` and ``|``). .. ipython:: python df.query('a < b & b < c') -Use English instead of symbols +Use English instead of symbols: .. ipython:: python df.query('a < b and b < c') -Pretty close to how you might write it on paper +Pretty close to how you might write it on paper: .. ipython:: python @@ -1356,7 +1360,7 @@ Special use of the ``==`` operator with ``list`` objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Comparing a ``list`` of values to a column using ``==``/``!=`` works similarly -to ``in``/``not in`` +to ``in``/``not in``. .. ipython:: python @@ -1391,7 +1395,7 @@ You can negate boolean expressions with the word ``not`` or the ``~`` operator. df.query('not bools') df.query('not bools') == df[~df.bools] -Of course, expressions can be arbitrarily complex too +Of course, expressions can be arbitrarily complex too: .. ipython:: python @@ -1420,7 +1424,7 @@ Performance of :meth:`~pandas.DataFrame.query` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``DataFrame.query()`` using ``numexpr`` is slightly faster than Python for -large frames +large frames. .. image:: _static/query-perf.png @@ -1428,7 +1432,7 @@ large frames You will only see the performance benefits of using the ``numexpr`` engine with ``DataFrame.query()`` if your frame has more than approximately 200,000 - rows + rows. .. image:: _static/query-perf-small.png @@ -1482,7 +1486,7 @@ Also, you can pass a list of columns to identify duplications. df2.drop_duplicates(['a', 'b']) To drop duplicates by index value, use ``Index.duplicated`` then perform slicing. -Same options are available in ``keep`` parameter. +The same set of options are available for the ``keep`` parameter. .. ipython:: python @@ -1514,7 +1518,7 @@ The :meth:`~pandas.DataFrame.lookup` Method Sometimes you want to extract a set of values given a sequence of row labels and column labels, and the ``lookup`` method allows for this and returns a -numpy array. For instance, +numpy array. For instance: .. ipython:: python @@ -1599,7 +1603,7 @@ Set operations on Index objects .. _indexing.set_ops: -The two main operations are ``union (|)``, ``intersection (&)`` +The two main operations are ``union (|)`` and ``intersection (&)``. These can be directly called as instance methods or used via overloaded operators. Difference is provided via the ``.difference()`` method. @@ -1612,7 +1616,7 @@ operators. Difference is provided via the ``.difference()`` method. a.difference(b) Also available is the ``symmetric_difference (^)`` operation, which returns elements -that appear in either ``idx1`` or ``idx2`` but not both. This is +that appear in either ``idx1`` or ``idx2``, but not in both. This is equivalent to the Index created by ``idx1.difference(idx2).union(idx2.difference(idx1))``, with duplicates dropped. @@ -1662,9 +1666,9 @@ Set an index .. _indexing.set_index: -DataFrame has a ``set_index`` method which takes a column name (for a regular -``Index``) or a list of column names (for a ``MultiIndex``), to create a new, -indexed DataFrame: +DataFrame has a :meth:`~DataFrame.set_index` method which takes a column name +(for a regular ``Index``) or a list of column names (for a ``MultiIndex``). +To create a new, re-indexed DataFrame: .. ipython:: python :suppress: @@ -1703,9 +1707,10 @@ the index in-place (without creating a new object): Reset the index ~~~~~~~~~~~~~~~ -As a convenience, there is a new function on DataFrame called ``reset_index`` -which transfers the index values into the DataFrame's columns and sets a simple -integer index. This is the inverse operation to ``set_index`` +As a convenience, there is a new function on DataFrame called +:meth:`~DataFrame.reset_index` which transfers the index values into the +DataFrame's columns and sets a simple integer index. +This is the inverse operation of :meth:`~DataFrame.set_index`. .. ipython:: python @@ -1726,11 +1731,6 @@ You can use the ``level`` keyword to remove only a portion of the index: ``reset_index`` takes an optional parameter ``drop`` which if true simply discards the index, instead of putting index values in the DataFrame's columns. -.. note:: - - The ``reset_index`` method used to be called ``delevel`` which is now - deprecated. - Adding an ad hoc index ~~~~~~~~~~~~~~~~~~~~~~ @@ -1769,7 +1769,7 @@ Compare these two access methods: dfmi.loc[:,('one','second')] These both yield the same results, so which should you use? It is instructive to understand the order -of operations on these and why method 2 (``.loc``) is much preferred over method 1 (chained ``[]``) +of operations on these and why method 2 (``.loc``) is much preferred over method 1 (chained ``[]``). ``dfmi['one']`` selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another python operation ``dfmi_with_one['second']`` selects the series indexed by ``'second'`` happens. @@ -1807,7 +1807,7 @@ But this code is handled differently: See that ``__getitem__`` in there? Outside of simple cases, it's very hard to predict whether it will return a view or a copy (it depends on the memory layout -of the array, about which *pandas* makes no guarantees), and therefore whether +of the array, about which pandas makes no guarantees), and therefore whether the ``__setitem__`` will modify ``dfmi`` or a temporary object that gets thrown out immediately afterward. **That's** what ``SettingWithCopy`` is warning you about! @@ -1882,9 +1882,9 @@ A chained assignment can also crop up in setting in a mixed dtype frame. .. note:: - These setting rules apply to all of ``.loc/.iloc`` + These setting rules apply to all of ``.loc/.iloc``. -This is the correct access method +This is the correct access method: .. ipython:: python @@ -1892,7 +1892,7 @@ This is the correct access method dfc.loc[0,'A'] = 11 dfc -This *can* work at times, but is not guaranteed, and so should be avoided +This *can* work at times, but it is not guaranteed to, and therefore should be avoided: .. ipython:: python :okwarning: @@ -1901,7 +1901,7 @@ This *can* work at times, but is not guaranteed, and so should be avoided dfc['A'][0] = 111 dfc -This will **not** work at all, and so should be avoided +This will **not** work at all, and so should be avoided: :: diff --git a/doc/source/options.rst b/doc/source/options.rst index 505a5ade68de0..5641b2628fe40 100644 --- a/doc/source/options.rst +++ b/doc/source/options.rst @@ -37,7 +37,7 @@ namespace: - :func:`~pandas.option_context` - execute a codeblock with a set of options that revert to prior settings after execution. -**Note:** developers can check out pandas/core/config.py for more info. +**Note:** Developers can check out `pandas/core/config.py `_ for more information. All of the functions above accept a regexp pattern (``re.search`` style) as an argument, and so passing in a substring will work - as long as it is unambiguous: @@ -78,8 +78,9 @@ with no argument ``describe_option`` will print out the descriptions for all ava Getting and Setting Options --------------------------- -As described above, ``get_option()`` and ``set_option()`` are available from the -pandas namespace. To change an option, call ``set_option('option regex', new_value)`` +As described above, :func:`~pandas.get_option` and :func:`~pandas.set_option` +are available from the pandas namespace. To change an option, call +``set_option('option regex', new_value)``. .. ipython:: python @@ -87,7 +88,7 @@ pandas namespace. To change an option, call ``set_option('option regex', new_va pd.set_option('mode.sim_interactive', True) pd.get_option('mode.sim_interactive') -**Note:** that the option 'mode.sim_interactive' is mostly used for debugging purposes. +**Note:** The option 'mode.sim_interactive' is mostly used for debugging purposes. All options also have a default value, and you can use ``reset_option`` to do just that: @@ -221,7 +222,7 @@ can specify the option ``df.info(null_counts=True)`` to override on showing a pa .. ipython:: python - df =pd.DataFrame(np.random.choice([0,1,np.nan], size=(10,10))) + df = pd.DataFrame(np.random.choice([0,1,np.nan], size=(10,10))) df pd.set_option('max_info_rows', 11) df.info() @@ -229,8 +230,8 @@ can specify the option ``df.info(null_counts=True)`` to override on showing a pa df.info() pd.reset_option('max_info_rows') -``display.precision`` sets the output display precision in terms of decimal places. This is only a -suggestion. +``display.precision`` sets the output display precision in terms of decimal places. +This is only a suggestion. .. ipython:: python @@ -241,7 +242,7 @@ suggestion. df ``display.chop_threshold`` sets at what level pandas rounds to zero when -it displays a Series of DataFrame. Note, this does not effect the +it displays a Series of DataFrame. This setting does not change the precision at which the number is stored. .. ipython:: python @@ -254,7 +255,7 @@ precision at which the number is stored. pd.reset_option('chop_threshold') ``display.colheader_justify`` controls the justification of the headers. -Options are 'right', and 'left'. +The options are 'right', and 'left'. .. ipython:: python diff --git a/doc/source/text.rst b/doc/source/text.rst index 2a86d92978043..2b6459b581c1e 100644 --- a/doc/source/text.rst +++ b/doc/source/text.rst @@ -99,7 +99,7 @@ Elements in the split lists can be accessed using ``get`` or ``[]`` notation: s2.str.split('_').str.get(1) s2.str.split('_').str[1] -Easy to expand this to return a DataFrame using ``expand``. +It is easy to expand this to return a DataFrame using ``expand``. .. ipython:: python @@ -268,7 +268,7 @@ It returns a Series if ``expand=False``. pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False) Calling on an ``Index`` with a regex with exactly one capture group -returns a ``DataFrame`` with one column if ``expand=True``, +returns a ``DataFrame`` with one column if ``expand=True``. .. ipython:: python @@ -373,7 +373,7 @@ You can check whether elements contain a pattern: pattern = r'[0-9][a-z]' pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern) -or match a pattern: +Or whether elements match a pattern: .. ipython:: python