From ce26fa08fd6fd8969e04dda56b5aa0e20dc1b597 Mon Sep 17 00:00:00 2001 From: Aidan Feldman Date: Sun, 3 Jan 2021 23:01:13 -0500 Subject: [PATCH 1/5] DOC: rewrite missing value includes to be more standalone Now they start with a sentence that sounds like a new paragraph. --- .../comparison/comparison_with_sas.rst | 2 ++ .../comparison/comparison_with_stata.rst | 2 ++ .../comparison/includes/missing.rst | 31 ++++++++++++------- .../comparison/includes/missing_intro.rst | 6 ++-- 4 files changed, 26 insertions(+), 15 deletions(-) diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst index b97efe31b8b29..e59691b7c0750 100644 --- a/doc/source/getting_started/comparison/comparison_with_sas.rst +++ b/doc/source/getting_started/comparison/comparison_with_sas.rst @@ -442,6 +442,8 @@ input frames. Missing data ------------ +Both pandas and SAS have a representation for missing data. + .. include:: includes/missing_intro.rst One difference is that missing data cannot be compared to its sentinel value. diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index ca536e7273870..c7c3e1ca0e9b5 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -434,6 +434,8 @@ or the intersection of the two by using the values created in the Missing data ------------ +Both pandas and Stata have a representation for missing data. + .. include:: includes/missing_intro.rst One difference is that missing data cannot be compared to its sentinel value. diff --git a/doc/source/getting_started/comparison/includes/missing.rst b/doc/source/getting_started/comparison/includes/missing.rst index 8e6ba95e98036..341c7d5498d82 100644 --- a/doc/source/getting_started/comparison/includes/missing.rst +++ b/doc/source/getting_started/comparison/includes/missing.rst @@ -1,24 +1,31 @@ -This doesn't work in pandas. Instead, the :func:`pd.isna` or :func:`pd.notna` functions -should be used for comparisons. +In pandas, :meth:`Series.isna` and :meth:`Series.notna` can be used to filter the rows. .. ipython:: python - outer_join[pd.isna(outer_join["value_x"])] - outer_join[pd.notna(outer_join["value_x"])] + outer_join[outer_join["value_x"].isna()] + outer_join[outer_join["value_x"].notna()] -pandas also provides a variety of methods to work with missing data -- some of -which would be challenging to express in Stata. For example, there are methods to -drop all rows with any missing values, replacing missing values with a specified -value, like the mean, or forward filling from previous rows. See the -:ref:`missing data documentation` for more. +pandas provides :ref:`a variety of methods to work with missing data `. Here are some examples: + +Drop rows with missing values +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. ipython:: python - # Drop rows with any missing value outer_join.dropna() - # Fill forwards +Forward fill from previous rows +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. ipython:: python + outer_join.fillna(method="ffill") - # Impute missing values with the mean +Replace missing values with a specified value +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the mean: + +.. ipython:: python + outer_join["value_x"].fillna(outer_join["value_x"].mean()) diff --git a/doc/source/getting_started/comparison/includes/missing_intro.rst b/doc/source/getting_started/comparison/includes/missing_intro.rst index ed97f639f3f3d..366aa43d1264c 100644 --- a/doc/source/getting_started/comparison/includes/missing_intro.rst +++ b/doc/source/getting_started/comparison/includes/missing_intro.rst @@ -1,6 +1,6 @@ -Both have a representation for missing data — pandas' is the special float value ``NaN`` (not a -number). Many of the semantics are the same; for example missing data propagates through numeric -operations, and is ignored by default for aggregations. +pandas represents missing data with the special float value ``NaN`` (not a number). Many of the +semantics are the same; for example missing data propagates through numeric operations, and is +ignored by default for aggregations. .. ipython:: python From 93fff50d41b9c24eb3fbc97396eb49434ee156a7 Mon Sep 17 00:00:00 2001 From: Aidan Feldman Date: Sun, 3 Jan 2021 23:13:12 -0500 Subject: [PATCH 2/5] DOC: split out more includes from comparison docs --- .../comparison/comparison_with_sas.rst | 27 ++----------------- .../comparison/comparison_with_stata.rst | 27 ++----------------- .../comparison/includes/column_operations.rst | 11 ++++++++ .../comparison/includes/column_selection.rst | 13 +++++++++ 4 files changed, 28 insertions(+), 50 deletions(-) create mode 100644 doc/source/getting_started/comparison/includes/column_operations.rst create mode 100644 doc/source/getting_started/comparison/includes/column_selection.rst diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst index e59691b7c0750..cc7fc7549a5c5 100644 --- a/doc/source/getting_started/comparison/comparison_with_sas.rst +++ b/doc/source/getting_started/comparison/comparison_with_sas.rst @@ -173,20 +173,8 @@ be used on new or existing columns. new_bill = total_bill / 2; run; -pandas provides similar vectorized operations by -specifying the individual ``Series`` in the ``DataFrame``. -New columns can be assigned in the same way. +.. include:: includes/column_operations.rst -.. ipython:: python - - tips["total_bill"] = tips["total_bill"] - 2 - tips["new_bill"] = tips["total_bill"] / 2.0 - tips.head() - -.. ipython:: python - :suppress: - - tips = tips.drop("new_bill", axis=1) Filtering ~~~~~~~~~ @@ -278,18 +266,7 @@ drop, and rename columns. rename total_bill=total_bill_2; run; -The same operations are expressed in pandas below. - -.. ipython:: python - - # keep - tips[["sex", "total_bill", "tip"]].head() - - # drop - tips.drop("sex", axis=1).head() - - # rename - tips.rename(columns={"total_bill": "total_bill_2"}).head() +.. include:: includes/column_selection.rst Sorting by values diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index c7c3e1ca0e9b5..93a5a10e80f74 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -179,18 +179,8 @@ the column from the data set. generate new_bill = total_bill / 2 drop new_bill -pandas provides similar vectorized operations by -specifying the individual ``Series`` in the ``DataFrame``. -New columns can be assigned in the same way. The :meth:`DataFrame.drop` method -drops a column from the ``DataFrame``. +.. include:: includes/column_operations.rst -.. ipython:: python - - tips["total_bill"] = tips["total_bill"] - 2 - tips["new_bill"] = tips["total_bill"] / 2 - tips.head() - - tips = tips.drop("new_bill", axis=1) Filtering ~~~~~~~~~ @@ -256,20 +246,7 @@ Stata provides keywords to select, drop, and rename columns. rename total_bill total_bill_2 -The same operations are expressed in pandas below. Note that in contrast to Stata, these -operations do not happen in place. To make these changes persist, assign the operation back -to a variable. - -.. ipython:: python - - # keep - tips[["sex", "total_bill", "tip"]].head() - - # drop - tips.drop("sex", axis=1).head() - - # rename - tips.rename(columns={"total_bill": "total_bill_2"}).head() +.. include:: includes/column_selection.rst Sorting by values diff --git a/doc/source/getting_started/comparison/includes/column_operations.rst b/doc/source/getting_started/comparison/includes/column_operations.rst new file mode 100644 index 0000000000000..d596d3494a357 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/column_operations.rst @@ -0,0 +1,11 @@ +pandas provides similar vectorized operations by specifying the individual ``Series`` in the +``DataFrame``. New columns can be assigned in the same way. The :meth:`DataFrame.drop` method drops +a column from the ``DataFrame``. + +.. ipython:: python + + tips["total_bill"] = tips["total_bill"] - 2 + tips["new_bill"] = tips["total_bill"] / 2 + tips.head() + + tips = tips.drop("new_bill", axis=1) diff --git a/doc/source/getting_started/comparison/includes/column_selection.rst b/doc/source/getting_started/comparison/includes/column_selection.rst new file mode 100644 index 0000000000000..c81fb3e85b003 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/column_selection.rst @@ -0,0 +1,13 @@ +The same operations are expressed in pandas below. Note that these operations do not happen in +place. To make these changes persist, assign the operation back to a variable. + +.. ipython:: python + + # keep + tips[["sex", "total_bill", "tip"]].head() + + # drop + tips.drop("sex", axis=1).head() + + # rename + tips.rename(columns={"total_bill": "total_bill_2"}).head() From ff640aad378a993dac85ab21b4abdae2f16c142e Mon Sep 17 00:00:00 2001 From: Aidan Feldman Date: Sun, 3 Jan 2021 20:51:36 -0500 Subject: [PATCH 3/5] DOC: remove use of head() in the comparison docs This helps to clarify the examples by removing code that isn't relevant. Added a dedicated section to the SAS, SQL, and Stata pages. --- .../comparison/comparison_with_sas.rst | 27 ++++++++++--------- .../comparison/comparison_with_sql.rst | 26 +++++++++++++----- .../comparison/comparison_with_stata.rst | 24 +++++++++-------- .../comparison/includes/column_operations.rst | 2 +- .../comparison/includes/column_selection.rst | 6 ++--- .../comparison/includes/extract_substring.rst | 2 +- .../comparison/includes/find_substring.rst | 2 +- .../comparison/includes/groupby.rst | 2 +- .../comparison/includes/if_then.rst | 2 +- .../comparison/includes/length.rst | 4 +-- .../comparison/includes/limit.rst | 7 +++++ .../comparison/includes/sorting.rst | 2 +- .../comparison/includes/time_date.rst | 2 +- .../comparison/includes/transform.rst | 2 +- 14 files changed, 67 insertions(+), 43 deletions(-) create mode 100644 doc/source/getting_started/comparison/includes/limit.rst diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst index cc7fc7549a5c5..2b316cccb7fc9 100644 --- a/doc/source/getting_started/comparison/comparison_with_sas.rst +++ b/doc/source/getting_started/comparison/comparison_with_sas.rst @@ -4,23 +4,13 @@ Comparison with SAS ******************** + For potential users coming from `SAS `__ this page is meant to demonstrate how different SAS operations would be performed in pandas. .. include:: includes/introduction.rst -.. note:: - - Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling - ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``. - This is often used in interactive work (e.g. `Jupyter notebook - `_ or terminal) - the equivalent in SAS would be: - - .. code-block:: sas - - proc print data=df(obs=5); - run; Data structures --------------- @@ -120,7 +110,7 @@ The pandas method is :func:`read_csv`, which works similarly. "pandas/master/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) - tips.head() + tips Like ``PROC IMPORT``, ``read_csv`` can take a number of parameters to specify @@ -138,6 +128,19 @@ In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL databases. These are all read via a ``pd.read_*`` function. See the :ref:`IO documentation` for more details. +Limiting output +~~~~~~~~~~~~~~~ + +.. include:: includes/limit.rst + +The equivalent in SAS would be: + +.. code-block:: sas + + proc print data=df(obs=5); + run; + + Exporting data ~~~~~~~~~~~~~~ diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst index 52799442d6118..685aea6334556 100644 --- a/doc/source/getting_started/comparison/comparison_with_sql.rst +++ b/doc/source/getting_started/comparison/comparison_with_sql.rst @@ -21,7 +21,7 @@ structure. "/pandas/master/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) - tips.head() + tips SELECT ------ @@ -31,14 +31,13 @@ to select all columns): .. code-block:: sql SELECT total_bill, tip, smoker, time - FROM tips - LIMIT 5; + FROM tips; With pandas, column selection is done by passing a list of column names to your DataFrame: .. ipython:: python - tips[["total_bill", "tip", "smoker", "time"]].head(5) + tips[["total_bill", "tip", "smoker", "time"]] Calling the DataFrame without the list of column names would display all columns (akin to SQL's ``*``). @@ -48,14 +47,13 @@ In SQL, you can add a calculated column: .. code-block:: sql SELECT *, tip/total_bill as tip_rate - FROM tips - LIMIT 5; + FROM tips; With pandas, you can use the :meth:`DataFrame.assign` method of a DataFrame to append a new column: .. ipython:: python - tips.assign(tip_rate=tips["tip"] / tips["total_bill"]).head(5) + tips.assign(tip_rate=tips["tip"] / tips["total_bill"]) WHERE ----- @@ -368,6 +366,20 @@ In pandas, you can use :meth:`~pandas.concat` in conjunction with pd.concat([df1, df2]).drop_duplicates() + +LIMIT +----- + +.. code-block:: sql + + SELECT * FROM tips + LIMIT 10; + +.. ipython:: python + + tips.head(10) + + pandas equivalents for some SQL analytic and aggregate functions ---------------------------------------------------------------- diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index 93a5a10e80f74..4520c6acf2ecc 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -10,16 +10,6 @@ performed in pandas. .. include:: includes/introduction.rst -.. note:: - - Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling - ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``. - This is often used in interactive work (e.g. `Jupyter notebook - `_ or terminal) -- the equivalent in Stata would be: - - .. code-block:: stata - - list in 1/5 Data structures --------------- @@ -116,7 +106,7 @@ the data set if presented with a url. "/pandas/master/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) - tips.head() + tips Like ``import delimited``, :func:`read_csv` can take a number of parameters to specify how the data should be parsed. For example, if the data were instead tab delimited, @@ -141,6 +131,18 @@ such as Excel, SAS, HDF5, Parquet, and SQL databases. These are all read via a function. See the :ref:`IO documentation` for more details. +Limiting output +~~~~~~~~~~~~~~~ + +.. include:: includes/limit.rst + +The equivalent in Stata would be: + +.. code-block:: stata + + list in 1/5 + + Exporting data ~~~~~~~~~~~~~~ diff --git a/doc/source/getting_started/comparison/includes/column_operations.rst b/doc/source/getting_started/comparison/includes/column_operations.rst index d596d3494a357..bc5db8e6b8038 100644 --- a/doc/source/getting_started/comparison/includes/column_operations.rst +++ b/doc/source/getting_started/comparison/includes/column_operations.rst @@ -6,6 +6,6 @@ a column from the ``DataFrame``. tips["total_bill"] = tips["total_bill"] - 2 tips["new_bill"] = tips["total_bill"] / 2 - tips.head() + tips tips = tips.drop("new_bill", axis=1) diff --git a/doc/source/getting_started/comparison/includes/column_selection.rst b/doc/source/getting_started/comparison/includes/column_selection.rst index c81fb3e85b003..d623a92b8af6c 100644 --- a/doc/source/getting_started/comparison/includes/column_selection.rst +++ b/doc/source/getting_started/comparison/includes/column_selection.rst @@ -4,10 +4,10 @@ place. To make these changes persist, assign the operation back to a variable. .. ipython:: python # keep - tips[["sex", "total_bill", "tip"]].head() + tips[["sex", "total_bill", "tip"]] # drop - tips.drop("sex", axis=1).head() + tips.drop("sex", axis=1) # rename - tips.rename(columns={"total_bill": "total_bill_2"}).head() + tips.rename(columns={"total_bill": "total_bill_2"}) diff --git a/doc/source/getting_started/comparison/includes/extract_substring.rst b/doc/source/getting_started/comparison/includes/extract_substring.rst index 78eee286ad467..1ba0dfac2317a 100644 --- a/doc/source/getting_started/comparison/includes/extract_substring.rst +++ b/doc/source/getting_started/comparison/includes/extract_substring.rst @@ -4,4 +4,4 @@ indexes are zero-based. .. ipython:: python - tips["sex"].str[0:1].head() + tips["sex"].str[0:1] diff --git a/doc/source/getting_started/comparison/includes/find_substring.rst b/doc/source/getting_started/comparison/includes/find_substring.rst index ee940b64f5cae..42543d05a0014 100644 --- a/doc/source/getting_started/comparison/includes/find_substring.rst +++ b/doc/source/getting_started/comparison/includes/find_substring.rst @@ -5,4 +5,4 @@ zero-based. .. ipython:: python - tips["sex"].str.find("ale").head() + tips["sex"].str.find("ale") diff --git a/doc/source/getting_started/comparison/includes/groupby.rst b/doc/source/getting_started/comparison/includes/groupby.rst index caa9f6ec9c9b8..93d5d51e3fb00 100644 --- a/doc/source/getting_started/comparison/includes/groupby.rst +++ b/doc/source/getting_started/comparison/includes/groupby.rst @@ -4,4 +4,4 @@ pandas provides a flexible ``groupby`` mechanism that allows similar aggregation .. ipython:: python tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum() - tips_summed.head() + tips_summed diff --git a/doc/source/getting_started/comparison/includes/if_then.rst b/doc/source/getting_started/comparison/includes/if_then.rst index d7977366cfc33..f94e7588827f5 100644 --- a/doc/source/getting_started/comparison/includes/if_then.rst +++ b/doc/source/getting_started/comparison/includes/if_then.rst @@ -4,7 +4,7 @@ the ``where`` method from ``numpy``. .. ipython:: python tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high") - tips.head() + tips .. ipython:: python :suppress: diff --git a/doc/source/getting_started/comparison/includes/length.rst b/doc/source/getting_started/comparison/includes/length.rst index 5a0c803e9eff2..9141fd4ea582a 100644 --- a/doc/source/getting_started/comparison/includes/length.rst +++ b/doc/source/getting_started/comparison/includes/length.rst @@ -4,5 +4,5 @@ Use ``len`` and ``rstrip`` to exclude trailing blanks. .. ipython:: python - tips["time"].str.len().head() - tips["time"].str.rstrip().str.len().head() + tips["time"].str.len() + tips["time"].str.rstrip().str.len() diff --git a/doc/source/getting_started/comparison/includes/limit.rst b/doc/source/getting_started/comparison/includes/limit.rst new file mode 100644 index 0000000000000..4efeb4e43d07c --- /dev/null +++ b/doc/source/getting_started/comparison/includes/limit.rst @@ -0,0 +1,7 @@ +By default, pandas will truncate output of large ``DataFrame``\s to show the first and last rows. +This can be overridden by :ref:`changing the pandas options `, or using +:meth:`DataFrame.head` or :meth:`DataFrame.tail`. + +.. ipython:: python + + tips.head(5) diff --git a/doc/source/getting_started/comparison/includes/sorting.rst b/doc/source/getting_started/comparison/includes/sorting.rst index 0840c9dd554b7..4e2e40a18adbd 100644 --- a/doc/source/getting_started/comparison/includes/sorting.rst +++ b/doc/source/getting_started/comparison/includes/sorting.rst @@ -3,4 +3,4 @@ pandas has a :meth:`DataFrame.sort_values` method, which takes a list of columns .. ipython:: python tips = tips.sort_values(["sex", "total_bill"]) - tips.head() + tips diff --git a/doc/source/getting_started/comparison/includes/time_date.rst b/doc/source/getting_started/comparison/includes/time_date.rst index 12a00b36dc97d..fb9ee2e216cd7 100644 --- a/doc/source/getting_started/comparison/includes/time_date.rst +++ b/doc/source/getting_started/comparison/includes/time_date.rst @@ -11,7 +11,7 @@ tips[ ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"] - ].head() + ] .. ipython:: python :suppress: diff --git a/doc/source/getting_started/comparison/includes/transform.rst b/doc/source/getting_started/comparison/includes/transform.rst index 0aa5b5b298cf7..b7599471432ad 100644 --- a/doc/source/getting_started/comparison/includes/transform.rst +++ b/doc/source/getting_started/comparison/includes/transform.rst @@ -5,4 +5,4 @@ succinctly expressed in one operation. gb = tips.groupby("smoker")["total_bill"] tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean") - tips.head() + tips From 8bd61713250a63605310be79e3ee574267e64dc9 Mon Sep 17 00:00:00 2001 From: Aidan Feldman Date: Sun, 3 Jan 2021 23:39:03 -0500 Subject: [PATCH 4/5] DOC: fix faulty include in Stata comparison page --- doc/source/getting_started/comparison/comparison_with_stata.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index 4520c6acf2ecc..43cb775b5461d 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -407,7 +407,7 @@ or the intersection of the two by using the values created in the restore merge 1:n key using df2.dta -.. include:: includes/merge_setup.rst +.. include:: includes/merge.rst Missing data From 199a04cf65bc8a1f210f3379738d7ae59a4e5d38 Mon Sep 17 00:00:00 2001 From: Aidan Feldman Date: Mon, 4 Jan 2021 00:08:02 -0500 Subject: [PATCH 5/5] DOC: add headings to the column selection include Easier to see what's happening in each line of code, and make navigation easier. --- .../comparison/includes/column_selection.rst | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/doc/source/getting_started/comparison/includes/column_selection.rst b/doc/source/getting_started/comparison/includes/column_selection.rst index d623a92b8af6c..b925af1294f54 100644 --- a/doc/source/getting_started/comparison/includes/column_selection.rst +++ b/doc/source/getting_started/comparison/includes/column_selection.rst @@ -1,13 +1,23 @@ The same operations are expressed in pandas below. Note that these operations do not happen in place. To make these changes persist, assign the operation back to a variable. +Keep certain columns +'''''''''''''''''''' + .. ipython:: python - # keep tips[["sex", "total_bill", "tip"]] - # drop +Drop a column +''''''''''''' + +.. ipython:: python + tips.drop("sex", axis=1) - # rename +Rename a column +''''''''''''''' + +.. ipython:: python + tips.rename(columns={"total_bill": "total_bill_2"})