pandas-dev · jreback · Jan 4, 2021 · Jan 4, 2021 · Jan 4, 2021 · Jan 4, 2021
diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst
@@ -4,23 +4,13 @@
 
 Comparison with SAS
 ********************
+
 For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__
 this page is meant to demonstrate how different SAS operations would be
 performed in pandas.
 
 .. include:: includes/introduction.rst
 
-.. note::
-
-   Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling
-   ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``.
-   This is often used in interactive work (e.g. `Jupyter notebook
-   <https://jupyter.org/>`_ or terminal) - the equivalent in SAS would be:
-
-   .. code-block:: sas
-
-      proc print data=df(obs=5);
-      run;
 
 Data structures
 ---------------
@@ -120,7 +110,7 @@ The pandas method is :func:`read_csv`, which works similarly.
        "pandas/master/pandas/tests/io/data/csv/tips.csv"
    )
    tips = pd.read_csv(url)
-   tips.head()
+   tips
 
 
 Like ``PROC IMPORT``, ``read_csv`` can take a number of parameters to specify
@@ -138,6 +128,19 @@ In addition to text/csv, pandas supports a variety of other data formats
 such as Excel, HDF5, and SQL databases.  These are all read via a ``pd.read_*``
 function.  See the :ref:`IO documentation<io>` for more details.
 
+Limiting output
+~~~~~~~~~~~~~~~
+
+.. include:: includes/limit.rst
+
+The equivalent in SAS would be:
+
+.. code-block:: sas
+
+   proc print data=df(obs=5);
+   run;
+
+
 Exporting data
 ~~~~~~~~~~~~~~
 
@@ -173,20 +176,8 @@ be used on new or existing columns.
        new_bill = total_bill / 2;
    run;
 
-pandas provides similar vectorized operations by
-specifying the individual ``Series`` in the ``DataFrame``.
-New columns can be assigned in the same way.
+.. include:: includes/column_operations.rst
 
-.. ipython:: python
-
-   tips["total_bill"] = tips["total_bill"] - 2
-   tips["new_bill"] = tips["total_bill"] / 2.0
-   tips.head()
-
-.. ipython:: python
-   :suppress:
-
-   tips = tips.drop("new_bill", axis=1)
 
 Filtering
 ~~~~~~~~~
@@ -278,18 +269,7 @@ drop, and rename columns.
        rename total_bill=total_bill_2;
    run;
 
-The same operations are expressed in pandas below.
-
-.. ipython:: python
-
-   # keep
-   tips[["sex", "total_bill", "tip"]].head()
-
-   # drop
-   tips.drop("sex", axis=1).head()
-
-   # rename
-   tips.rename(columns={"total_bill": "total_bill_2"}).head()
+.. include:: includes/column_selection.rst
 
 
 Sorting by values
@@ -442,6 +422,8 @@ input frames.
 Missing data
 ------------
 
+Both pandas and SAS have a representation for missing data.
+
 .. include:: includes/missing_intro.rst
 
 One difference is that missing data cannot be compared to its sentinel value.

diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst
@@ -21,7 +21,7 @@ structure.
         "/pandas/master/pandas/tests/io/data/csv/tips.csv"
     )
     tips = pd.read_csv(url)
-    tips.head()
+    tips
 
 SELECT
 ------
@@ -31,14 +31,13 @@ to select all columns):
 .. code-block:: sql
 
     SELECT total_bill, tip, smoker, time
-    FROM tips
-    LIMIT 5;
+    FROM tips;
 
 With pandas, column selection is done by passing a list of column names to your DataFrame:
 
 .. ipython:: python
 
-    tips[["total_bill", "tip", "smoker", "time"]].head(5)
+    tips[["total_bill", "tip", "smoker", "time"]]
 
 Calling the DataFrame without the list of column names would display all columns (akin to SQL's
 ``*``).
@@ -48,14 +47,13 @@ In SQL, you can add a calculated column:
 .. code-block:: sql
 
     SELECT *, tip/total_bill as tip_rate
-    FROM tips
-    LIMIT 5;
+    FROM tips;
 
 With pandas, you can use the :meth:`DataFrame.assign` method of a DataFrame to append a new column:
 
 .. ipython:: python
 
-    tips.assign(tip_rate=tips["tip"] / tips["total_bill"]).head(5)
+    tips.assign(tip_rate=tips["tip"] / tips["total_bill"])
 
 WHERE
 -----
@@ -368,6 +366,20 @@ In pandas, you can use :meth:`~pandas.concat` in conjunction with
 
     pd.concat([df1, df2]).drop_duplicates()
 
+
+LIMIT
+-----
+
+.. code-block:: sql
+
+    SELECT * FROM tips
+    LIMIT 10;
+
+.. ipython:: python
+
+    tips.head(10)
+
+
 pandas equivalents for some SQL analytic and aggregate functions
 ----------------------------------------------------------------
 

diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst
@@ -10,16 +10,6 @@ performed in pandas.
 
 .. include:: includes/introduction.rst
 
-.. note::
-
-   Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling
-   ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``.
-   This is often used in interactive work (e.g. `Jupyter notebook
-   <https://jupyter.org/>`_ or terminal) -- the equivalent in Stata would be:
-
-   .. code-block:: stata
-
-      list in 1/5
 
 Data structures
 ---------------
@@ -116,7 +106,7 @@ the data set if presented with a url.
        "/pandas/master/pandas/tests/io/data/csv/tips.csv"
    )
    tips = pd.read_csv(url)
-   tips.head()
+   tips
 
 Like ``import delimited``, :func:`read_csv` can take a number of parameters to specify
 how the data should be parsed.  For example, if the data were instead tab delimited,
@@ -141,6 +131,18 @@ such as Excel, SAS, HDF5, Parquet, and SQL databases.  These are all read via a
 function.  See the :ref:`IO documentation<io>` for more details.
 
 
+Limiting output
+~~~~~~~~~~~~~~~
+
+.. include:: includes/limit.rst
+
+The equivalent in Stata would be:
+
+.. code-block:: stata
+
+   list in 1/5
+
+
 Exporting data
 ~~~~~~~~~~~~~~
 
@@ -179,18 +181,8 @@ the column from the data set.
    generate new_bill = total_bill / 2
    drop new_bill
 
-pandas provides similar vectorized operations by
-specifying the individual ``Series`` in the ``DataFrame``.
-New columns can be assigned in the same way. The :meth:`DataFrame.drop` method
-drops a column from the ``DataFrame``.
+.. include:: includes/column_operations.rst
 
-.. ipython:: python
-
-   tips["total_bill"] = tips["total_bill"] - 2
-   tips["new_bill"] = tips["total_bill"] / 2
-   tips.head()
-
-   tips = tips.drop("new_bill", axis=1)
 
 Filtering
 ~~~~~~~~~
@@ -256,20 +248,7 @@ Stata provides keywords to select, drop, and rename columns.
 
    rename total_bill total_bill_2
 
-The same operations are expressed in pandas below. Note that in contrast to Stata, these
-operations do not happen in place. To make these changes persist, assign the operation back
-to a variable.
-
-.. ipython:: python
-
-   # keep
-   tips[["sex", "total_bill", "tip"]].head()
-
-   # drop
-   tips.drop("sex", axis=1).head()
-
-   # rename
-   tips.rename(columns={"total_bill": "total_bill_2"}).head()
+.. include:: includes/column_selection.rst
 
 
 Sorting by values
@@ -428,12 +407,14 @@ or the intersection of the two by using the values created in the
    restore
    merge 1:n key using df2.dta
 
-.. include:: includes/merge_setup.rst
+.. include:: includes/merge.rst
 
 
 Missing data
 ------------
 
+Both pandas and Stata have a representation for missing data.
+
 .. include:: includes/missing_intro.rst
 
 One difference is that missing data cannot be compared to its sentinel value.

diff --git a/doc/source/getting_started/comparison/includes/column_operations.rst b/doc/source/getting_started/comparison/includes/column_operations.rst
@@ -0,0 +1,11 @@
+pandas provides similar vectorized operations by specifying the individual ``Series`` in the
+``DataFrame``. New columns can be assigned in the same way. The :meth:`DataFrame.drop` method drops
+a column from the ``DataFrame``.
+
+.. ipython:: python
+
+   tips["total_bill"] = tips["total_bill"] - 2
+   tips["new_bill"] = tips["total_bill"] / 2
+   tips
+
+   tips = tips.drop("new_bill", axis=1)
diff --git a/doc/source/getting_started/comparison/includes/column_selection.rst b/doc/source/getting_started/comparison/includes/column_selection.rst
@@ -0,0 +1,23 @@
+The same operations are expressed in pandas below. Note that these operations do not happen in
+place. To make these changes persist, assign the operation back to a variable.
+
+Keep certain columns
+''''''''''''''''''''
+
+.. ipython:: python
+
+   tips[["sex", "total_bill", "tip"]]
+
+Drop a column
+'''''''''''''
+
+.. ipython:: python
+
+   tips.drop("sex", axis=1)
+
+Rename a column
+'''''''''''''''
+
+.. ipython:: python
+
+   tips.rename(columns={"total_bill": "total_bill_2"})
diff --git a/doc/source/getting_started/comparison/includes/extract_substring.rst b/doc/source/getting_started/comparison/includes/extract_substring.rst
@@ -4,4 +4,4 @@ indexes are zero-based.
 
 .. ipython:: python
 
-   tips["sex"].str[0:1].head()
+   tips["sex"].str[0:1]
diff --git a/doc/source/getting_started/comparison/includes/find_substring.rst b/doc/source/getting_started/comparison/includes/find_substring.rst
@@ -5,4 +5,4 @@ zero-based.
 
 .. ipython:: python
 
-   tips["sex"].str.find("ale").head()
+   tips["sex"].str.find("ale")
diff --git a/doc/source/getting_started/comparison/includes/groupby.rst b/doc/source/getting_started/comparison/includes/groupby.rst
@@ -4,4 +4,4 @@ pandas provides a flexible ``groupby`` mechanism that allows similar aggregation
 .. ipython:: python
 
    tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()
-   tips_summed.head()
+   tips_summed
diff --git a/doc/source/getting_started/comparison/includes/if_then.rst b/doc/source/getting_started/comparison/includes/if_then.rst
@@ -4,7 +4,7 @@ the ``where`` method from ``numpy``.
 .. ipython:: python
 
    tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")
-   tips.head()
+   tips
 
 .. ipython:: python
    :suppress:

diff --git a/doc/source/getting_started/comparison/includes/length.rst b/doc/source/getting_started/comparison/includes/length.rst
@@ -4,5 +4,5 @@ Use ``len`` and ``rstrip`` to exclude trailing blanks.
 
 .. ipython:: python
 
-   tips["time"].str.len().head()
-   tips["time"].str.rstrip().str.len().head()
+   tips["time"].str.len()
+   tips["time"].str.rstrip().str.len()
diff --git a/doc/source/getting_started/comparison/includes/limit.rst b/doc/source/getting_started/comparison/includes/limit.rst
@@ -0,0 +1,7 @@
+By default, pandas will truncate output of large ``DataFrame``\s to show the first and last rows.
+This can be overridden by :ref:`changing the pandas options <options>`, or using
+:meth:`DataFrame.head` or :meth:`DataFrame.tail`.
+
+.. ipython:: python
+
+   tips.head(5)
diff --git a/doc/source/getting_started/comparison/includes/missing.rst b/doc/source/getting_started/comparison/includes/missing.rst
@@ -1,24 +1,31 @@
-This doesn't work in pandas.  Instead, the :func:`pd.isna` or :func:`pd.notna` functions
-should be used for comparisons.
+In pandas, :meth:`Series.isna` and :meth:`Series.notna` can be used to filter the rows.
 
 .. ipython:: python
 
-   outer_join[pd.isna(outer_join["value_x"])]
-   outer_join[pd.notna(outer_join["value_x"])]
+   outer_join[outer_join["value_x"].isna()]
+   outer_join[outer_join["value_x"].notna()]
 
-pandas also provides a variety of methods to work with missing data -- some of
-which would be challenging to express in Stata. For example, there are methods to
-drop all rows with any missing values, replacing missing values with a specified
-value, like the mean, or forward filling from previous rows. See the
-:ref:`missing data documentation<missing_data>` for more.
+pandas provides :ref:`a variety of methods to work with missing data <missing_data>`. Here are some examples:
+
+Drop rows with missing values
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. ipython:: python
 
-   # Drop rows with any missing value
    outer_join.dropna()
 
-   # Fill forwards
+Forward fill from previous rows
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. ipython:: python
+
    outer_join.fillna(method="ffill")
 
-   # Impute missing values with the mean
+Replace missing values with a specified value
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Using the mean:
+
+.. ipython:: python
+
    outer_join["value_x"].fillna(outer_join["value_x"].mean())
Original file line number	Diff line number	Diff line change
Expand Up		@@ -4,4 +4,4 @@ indexes are zero-based.

		.. ipython:: python

		tips["sex"].str[0:1].head()
		tips["sex"].str[0:1]
Original file line number	Diff line number	Diff line change
Expand Up		@@ -5,4 +5,4 @@ zero-based.

		.. ipython:: python

		tips["sex"].str.find("ale").head()
		tips["sex"].str.find("ale")