DOC: create shared includes for content shared by comparison docs (#38887)

afeld · web-flow · commit 28b938d4ed76 · 2021-01-03T12:26:01.000-05:00
diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst
@@ -8,7 +8,7 @@ For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(softwar
 this page is meant to demonstrate how different SAS operations would be
 performed in pandas.
 
-.. include:: comparison_boilerplate.rst
+.. include:: includes/introduction.rst
 
 .. note::
 
@@ -93,16 +93,7 @@ specifying the column names.
        ;
    run;
 
-A pandas ``DataFrame`` can be constructed in many different ways,
-but for a small number of values, it is often convenient to specify it as
-a Python dictionary, where the keys are the column names
-and the values are the data.
-
-.. ipython:: python
-
-   df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})
-   df
-
+.. include:: includes/construct_dataframe.rst
 
 Reading external data
 ~~~~~~~~~~~~~~~~~~~~~
@@ -217,12 +208,7 @@ or more columns.
           DATA step begins and can also be used in PROC statements */
    run;
 
-DataFrames can be filtered in multiple ways; the most intuitive of which is using
-:ref:`boolean indexing <indexing.boolean>`
-
-.. ipython:: python
-
-   tips[tips["total_bill"] > 10].head()
+.. include:: includes/filtering.rst
 
 If/then logic
 ~~~~~~~~~~~~~
@@ -239,18 +225,7 @@ In SAS, if/then logic can be used to create new columns.
        else bucket = 'high';
    run;
 
-The same operation in pandas can be accomplished using
-the ``where`` method from ``numpy``.
-
-.. ipython:: python
-
-   tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")
-   tips.head()
-
-.. ipython:: python
-   :suppress:
-
-   tips = tips.drop("bucket", axis=1)
+.. include:: includes/if_then.rst
 
 Date functionality
 ~~~~~~~~~~~~~~~~~~
@@ -278,28 +253,7 @@ functions pandas supports other Time Series features
 not available in Base SAS (such as resampling and custom offsets) -
 see the :ref:`timeseries documentation<timeseries>` for more details.
 
-.. ipython:: python
-
-   tips["date1"] = pd.Timestamp("2013-01-15")
-   tips["date2"] = pd.Timestamp("2015-02-15")
-   tips["date1_year"] = tips["date1"].dt.year
-   tips["date2_month"] = tips["date2"].dt.month
-   tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()
-   tips["months_between"] = tips["date2"].dt.to_period("M") - tips[
-       "date1"
-   ].dt.to_period("M")
-
-   tips[
-       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"]
-   ].head()
-
-.. ipython:: python
-   :suppress:
-
-   tips = tips.drop(
-       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"],
-       axis=1,
-   )
+.. include:: includes/time_date.rst
 
 Selection of columns
 ~~~~~~~~~~~~~~~~~~~~
@@ -349,14 +303,7 @@ Sorting in SAS is accomplished via ``PROC SORT``
        by sex total_bill;
    run;
 
-pandas objects have a :meth:`~DataFrame.sort_values` method, which
-takes a list of columns to sort by.
-
-.. ipython:: python
-
-   tips = tips.sort_values(["sex", "total_bill"])
-   tips.head()
-
+.. include:: includes/sorting.rst
 
 String processing
 -----------------
@@ -377,14 +324,7 @@ functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailin
    put(LENGTHC(time));
    run;
 
-Python determines the length of a character string with the ``len`` function.
-``len`` includes trailing blanks.  Use ``len`` and ``rstrip`` to exclude
-trailing blanks.
-
-.. ipython:: python
-
-   tips["time"].str.len().head()
-   tips["time"].str.rstrip().str.len().head()
+.. include:: includes/length.rst
 
 
 Find
diff --git a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst
@@ -14,7 +14,7 @@ terminology and link to documentation for Excel, but much will be the same/simil
 `Apple Numbers <https://www.apple.com/mac/numbers/compatibility/functions.html>`_, and other
 Excel-compatible spreadsheet software.
 
-.. include:: comparison_boilerplate.rst
+.. include:: includes/introduction.rst
 
 Data structures
 ---------------
diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst
@@ -8,7 +8,7 @@ Since many potential pandas users have some familiarity with
 `SQL <https://en.wikipedia.org/wiki/SQL>`_, this page is meant to provide some examples of how
 various SQL operations would be performed using pandas.
 
-.. include:: comparison_boilerplate.rst
+.. include:: includes/introduction.rst
 
 Most of the examples will utilize the ``tips`` dataset found within pandas tests.  We'll read
 the data into a DataFrame called ``tips`` and assume we have a database table of the same name and
@@ -65,24 +65,9 @@ Filtering in SQL is done via a WHERE clause.
 
     SELECT *
     FROM tips
-    WHERE time = 'Dinner'
-    LIMIT 5;
-
-DataFrames can be filtered in multiple ways; the most intuitive of which is using
-:ref:`boolean indexing <indexing.boolean>`
-
-.. ipython:: python
-
-    tips[tips["time"] == "Dinner"].head(5)
-
-The above statement is simply passing a ``Series`` of True/False objects to the DataFrame,
-returning all rows with True.
-
-.. ipython:: python
+    WHERE time = 'Dinner';
 
-    is_dinner = tips["time"] == "Dinner"
-    is_dinner.value_counts()
-    tips[is_dinner].head(5)
+.. include:: includes/filtering.rst
 
 Just like SQL's OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and &
 (AND).
diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst
@@ -8,7 +8,7 @@ For potential users coming from `Stata <https://en.wikipedia.org/wiki/Stata>`__
 this page is meant to demonstrate how different Stata operations would be
 performed in pandas.
 
-.. include:: comparison_boilerplate.rst
+.. include:: includes/introduction.rst
 
 .. note::
 
@@ -89,16 +89,7 @@ specifying the column names.
    5 6
    end
 
-A pandas ``DataFrame`` can be constructed in many different ways,
-but for a small number of values, it is often convenient to specify it as
-a Python dictionary, where the keys are the column names
-and the values are the data.
-
-.. ipython:: python
-
-   df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})
-   df
-
+.. include:: includes/construct_dataframe.rst
 
 Reading external data
 ~~~~~~~~~~~~~~~~~~~~~
@@ -210,12 +201,7 @@ Filtering in Stata is done with an ``if`` clause on one or more columns.
 
    list if total_bill > 10
 
-DataFrames can be filtered in multiple ways; the most intuitive of which is using
-:ref:`boolean indexing <indexing.boolean>`.
-
-.. ipython:: python
-
-   tips[tips["total_bill"] > 10].head()
+.. include:: includes/filtering.rst
 
 If/then logic
 ~~~~~~~~~~~~~
@@ -227,18 +213,7 @@ In Stata, an ``if`` clause can also be used to create new columns.
    generate bucket = "low" if total_bill < 10
    replace bucket = "high" if total_bill >= 10
 
-The same operation in pandas can be accomplished using
-the ``where`` method from ``numpy``.
-
-.. ipython:: python
-
-   tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")
-   tips.head()
-
-.. ipython:: python
-   :suppress:
-
-   tips = tips.drop("bucket", axis=1)
+.. include:: includes/if_then.rst
 
 Date functionality
 ~~~~~~~~~~~~~~~~~~
@@ -266,28 +241,7 @@ functions, pandas supports other Time Series features
 not available in Stata (such as time zone handling and custom offsets) --
 see the :ref:`timeseries documentation<timeseries>` for more details.
 
-.. ipython:: python
-
-   tips["date1"] = pd.Timestamp("2013-01-15")
-   tips["date2"] = pd.Timestamp("2015-02-15")
-   tips["date1_year"] = tips["date1"].dt.year
-   tips["date2_month"] = tips["date2"].dt.month
-   tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()
-   tips["months_between"] = tips["date2"].dt.to_period("M") - tips[
-       "date1"
-   ].dt.to_period("M")
-
-   tips[
-       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"]
-   ].head()
-
-.. ipython:: python
-   :suppress:
-
-   tips = tips.drop(
-       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"],
-       axis=1,
-   )
+.. include:: includes/time_date.rst
 
 Selection of columns
 ~~~~~~~~~~~~~~~~~~~~
@@ -327,14 +281,7 @@ Sorting in Stata is accomplished via ``sort``
 
    sort sex total_bill
 
-pandas objects have a :meth:`DataFrame.sort_values` method, which
-takes a list of columns to sort by.
-
-.. ipython:: python
-
-   tips = tips.sort_values(["sex", "total_bill"])
-   tips.head()
-
+.. include:: includes/sorting.rst
 
 String processing
 -----------------
@@ -350,14 +297,7 @@ Stata determines the length of a character string with the :func:`strlen` and
    generate strlen_time = strlen(time)
    generate ustrlen_time = ustrlen(time)
 
-Python determines the length of a character string with the ``len`` function.
-In Python 3, all strings are Unicode strings. ``len`` includes trailing blanks.
-Use ``len`` and ``rstrip`` to exclude trailing blanks.
-
-.. ipython:: python
-
-   tips["time"].str.len().head()
-   tips["time"].str.rstrip().str.len().head()
+.. include:: includes/length.rst
 
 
 Finding position of substring
diff --git a/doc/source/getting_started/comparison/includes/construct_dataframe.rst b/doc/source/getting_started/comparison/includes/construct_dataframe.rst
@@ -0,0 +1,9 @@
+A pandas ``DataFrame`` can be constructed in many different ways,
+but for a small number of values, it is often convenient to specify it as
+a Python dictionary, where the keys are the column names
+and the values are the data.
+
+.. ipython:: python
+
+   df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})
+   df
diff --git a/doc/source/getting_started/comparison/includes/filtering.rst b/doc/source/getting_started/comparison/includes/filtering.rst
@@ -0,0 +1,16 @@
+DataFrames can be filtered in multiple ways; the most intuitive of which is using
+:ref:`boolean indexing <indexing.boolean>`
+
+.. ipython:: python
+
+   tips[tips["total_bill"] > 10]
+
+The above statement is simply passing a ``Series`` of ``True``/``False`` objects to the DataFrame,
+returning all rows with ``True``.
+
+.. ipython:: python
+
+    is_dinner = tips["time"] == "Dinner"
+    is_dinner
+    is_dinner.value_counts()
+    tips[is_dinner]
diff --git a/doc/source/getting_started/comparison/includes/if_then.rst b/doc/source/getting_started/comparison/includes/if_then.rst
@@ -0,0 +1,12 @@
+The same operation in pandas can be accomplished using
+the ``where`` method from ``numpy``.
+
+.. ipython:: python
+
+   tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")
+   tips.head()
+
+.. ipython:: python
+   :suppress:
+
+   tips = tips.drop("bucket", axis=1)
diff --git a/doc/source/getting_started/comparison/includes/introduction.rst b/doc/source/getting_started/comparison/includes/introduction.rst
diff --git a/doc/source/getting_started/comparison/includes/length.rst b/doc/source/getting_started/comparison/includes/length.rst
@@ -0,0 +1,8 @@
+Python determines the length of a character string with the ``len`` function.
+In Python 3, all strings are Unicode strings. ``len`` includes trailing blanks.
+Use ``len`` and ``rstrip`` to exclude trailing blanks.
+
+.. ipython:: python
+
+   tips["time"].str.len().head()
+   tips["time"].str.rstrip().str.len().head()
diff --git a/doc/source/getting_started/comparison/includes/sorting.rst b/doc/source/getting_started/comparison/includes/sorting.rst
@@ -0,0 +1,7 @@
+pandas objects have a :meth:`DataFrame.sort_values` method, which
+takes a list of columns to sort by.
+
+.. ipython:: python
+
+   tips = tips.sort_values(["sex", "total_bill"])
+   tips.head()
diff --git a/doc/source/getting_started/comparison/includes/time_date.rst b/doc/source/getting_started/comparison/includes/time_date.rst
@@ -0,0 +1,22 @@
+.. ipython:: python
+
+   tips["date1"] = pd.Timestamp("2013-01-15")
+   tips["date2"] = pd.Timestamp("2015-02-15")
+   tips["date1_year"] = tips["date1"].dt.year
+   tips["date2_month"] = tips["date2"].dt.month
+   tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()
+   tips["months_between"] = tips["date2"].dt.to_period("M") - tips[
+       "date1"
+   ].dt.to_period("M")
+
+   tips[
+       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"]
+   ].head()
+
+.. ipython:: python
+   :suppress:
+
+   tips = tips.drop(
+       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"],
+       axis=1,
+   )
diff --git a/setup.cfg b/setup.cfg
@@ -49,7 +49,10 @@ ignore = E203,  # space before : (needed for how black formats slicing)
          E711,  # comparison to none should be 'if cond is none:'
 
 exclude =
-    doc/source/development/contributing_docstring.rst
+    doc/source/development/contributing_docstring.rst,
+    # work around issue of undefined variable warnings
+    # https://github.com/pandas-dev/pandas/pull/38837#issuecomment-752884156
+    doc/source/getting_started/comparison/includes/*.rst
 
 [tool:pytest]
 # sync minversion with setup.cfg & install.rst