diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst index b97efe31b8b29..2b316cccb7fc9 100644 --- a/doc/source/getting_started/comparison/comparison_with_sas.rst +++ b/doc/source/getting_started/comparison/comparison_with_sas.rst @@ -4,23 +4,13 @@ Comparison with SAS ******************** + For potential users coming from `SAS `__ this page is meant to demonstrate how different SAS operations would be performed in pandas. .. include:: includes/introduction.rst -.. note:: - - Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling - ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``. - This is often used in interactive work (e.g. `Jupyter notebook - `_ or terminal) - the equivalent in SAS would be: - - .. code-block:: sas - - proc print data=df(obs=5); - run; Data structures --------------- @@ -120,7 +110,7 @@ The pandas method is :func:`read_csv`, which works similarly. "pandas/master/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) - tips.head() + tips Like ``PROC IMPORT``, ``read_csv`` can take a number of parameters to specify @@ -138,6 +128,19 @@ In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL databases. These are all read via a ``pd.read_*`` function. See the :ref:`IO documentation` for more details. +Limiting output +~~~~~~~~~~~~~~~ + +.. include:: includes/limit.rst + +The equivalent in SAS would be: + +.. code-block:: sas + + proc print data=df(obs=5); + run; + + Exporting data ~~~~~~~~~~~~~~ @@ -173,20 +176,8 @@ be used on new or existing columns. new_bill = total_bill / 2; run; -pandas provides similar vectorized operations by -specifying the individual ``Series`` in the ``DataFrame``. -New columns can be assigned in the same way. +.. include:: includes/column_operations.rst -.. ipython:: python - - tips["total_bill"] = tips["total_bill"] - 2 - tips["new_bill"] = tips["total_bill"] / 2.0 - tips.head() - -.. ipython:: python - :suppress: - - tips = tips.drop("new_bill", axis=1) Filtering ~~~~~~~~~ @@ -278,18 +269,7 @@ drop, and rename columns. rename total_bill=total_bill_2; run; -The same operations are expressed in pandas below. - -.. ipython:: python - - # keep - tips[["sex", "total_bill", "tip"]].head() - - # drop - tips.drop("sex", axis=1).head() - - # rename - tips.rename(columns={"total_bill": "total_bill_2"}).head() +.. include:: includes/column_selection.rst Sorting by values @@ -442,6 +422,8 @@ input frames. Missing data ------------ +Both pandas and SAS have a representation for missing data. + .. include:: includes/missing_intro.rst One difference is that missing data cannot be compared to its sentinel value. diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst index 52799442d6118..685aea6334556 100644 --- a/doc/source/getting_started/comparison/comparison_with_sql.rst +++ b/doc/source/getting_started/comparison/comparison_with_sql.rst @@ -21,7 +21,7 @@ structure. "/pandas/master/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) - tips.head() + tips SELECT ------ @@ -31,14 +31,13 @@ to select all columns): .. code-block:: sql SELECT total_bill, tip, smoker, time - FROM tips - LIMIT 5; + FROM tips; With pandas, column selection is done by passing a list of column names to your DataFrame: .. ipython:: python - tips[["total_bill", "tip", "smoker", "time"]].head(5) + tips[["total_bill", "tip", "smoker", "time"]] Calling the DataFrame without the list of column names would display all columns (akin to SQL's ``*``). @@ -48,14 +47,13 @@ In SQL, you can add a calculated column: .. code-block:: sql SELECT *, tip/total_bill as tip_rate - FROM tips - LIMIT 5; + FROM tips; With pandas, you can use the :meth:`DataFrame.assign` method of a DataFrame to append a new column: .. ipython:: python - tips.assign(tip_rate=tips["tip"] / tips["total_bill"]).head(5) + tips.assign(tip_rate=tips["tip"] / tips["total_bill"]) WHERE ----- @@ -368,6 +366,20 @@ In pandas, you can use :meth:`~pandas.concat` in conjunction with pd.concat([df1, df2]).drop_duplicates() + +LIMIT +----- + +.. code-block:: sql + + SELECT * FROM tips + LIMIT 10; + +.. ipython:: python + + tips.head(10) + + pandas equivalents for some SQL analytic and aggregate functions ---------------------------------------------------------------- diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index ca536e7273870..43cb775b5461d 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -10,16 +10,6 @@ performed in pandas. .. include:: includes/introduction.rst -.. note:: - - Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling - ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``. - This is often used in interactive work (e.g. `Jupyter notebook - `_ or terminal) -- the equivalent in Stata would be: - - .. code-block:: stata - - list in 1/5 Data structures --------------- @@ -116,7 +106,7 @@ the data set if presented with a url. "/pandas/master/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) - tips.head() + tips Like ``import delimited``, :func:`read_csv` can take a number of parameters to specify how the data should be parsed. For example, if the data were instead tab delimited, @@ -141,6 +131,18 @@ such as Excel, SAS, HDF5, Parquet, and SQL databases. These are all read via a function. See the :ref:`IO documentation` for more details. +Limiting output +~~~~~~~~~~~~~~~ + +.. include:: includes/limit.rst + +The equivalent in Stata would be: + +.. code-block:: stata + + list in 1/5 + + Exporting data ~~~~~~~~~~~~~~ @@ -179,18 +181,8 @@ the column from the data set. generate new_bill = total_bill / 2 drop new_bill -pandas provides similar vectorized operations by -specifying the individual ``Series`` in the ``DataFrame``. -New columns can be assigned in the same way. The :meth:`DataFrame.drop` method -drops a column from the ``DataFrame``. +.. include:: includes/column_operations.rst -.. ipython:: python - - tips["total_bill"] = tips["total_bill"] - 2 - tips["new_bill"] = tips["total_bill"] / 2 - tips.head() - - tips = tips.drop("new_bill", axis=1) Filtering ~~~~~~~~~ @@ -256,20 +248,7 @@ Stata provides keywords to select, drop, and rename columns. rename total_bill total_bill_2 -The same operations are expressed in pandas below. Note that in contrast to Stata, these -operations do not happen in place. To make these changes persist, assign the operation back -to a variable. - -.. ipython:: python - - # keep - tips[["sex", "total_bill", "tip"]].head() - - # drop - tips.drop("sex", axis=1).head() - - # rename - tips.rename(columns={"total_bill": "total_bill_2"}).head() +.. include:: includes/column_selection.rst Sorting by values @@ -428,12 +407,14 @@ or the intersection of the two by using the values created in the restore merge 1:n key using df2.dta -.. include:: includes/merge_setup.rst +.. include:: includes/merge.rst Missing data ------------ +Both pandas and Stata have a representation for missing data. + .. include:: includes/missing_intro.rst One difference is that missing data cannot be compared to its sentinel value. diff --git a/doc/source/getting_started/comparison/includes/column_operations.rst b/doc/source/getting_started/comparison/includes/column_operations.rst new file mode 100644 index 0000000000000..bc5db8e6b8038 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/column_operations.rst @@ -0,0 +1,11 @@ +pandas provides similar vectorized operations by specifying the individual ``Series`` in the +``DataFrame``. New columns can be assigned in the same way. The :meth:`DataFrame.drop` method drops +a column from the ``DataFrame``. + +.. ipython:: python + + tips["total_bill"] = tips["total_bill"] - 2 + tips["new_bill"] = tips["total_bill"] / 2 + tips + + tips = tips.drop("new_bill", axis=1) diff --git a/doc/source/getting_started/comparison/includes/column_selection.rst b/doc/source/getting_started/comparison/includes/column_selection.rst new file mode 100644 index 0000000000000..b925af1294f54 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/column_selection.rst @@ -0,0 +1,23 @@ +The same operations are expressed in pandas below. Note that these operations do not happen in +place. To make these changes persist, assign the operation back to a variable. + +Keep certain columns +'''''''''''''''''''' + +.. ipython:: python + + tips[["sex", "total_bill", "tip"]] + +Drop a column +''''''''''''' + +.. ipython:: python + + tips.drop("sex", axis=1) + +Rename a column +''''''''''''''' + +.. ipython:: python + + tips.rename(columns={"total_bill": "total_bill_2"}) diff --git a/doc/source/getting_started/comparison/includes/extract_substring.rst b/doc/source/getting_started/comparison/includes/extract_substring.rst index 78eee286ad467..1ba0dfac2317a 100644 --- a/doc/source/getting_started/comparison/includes/extract_substring.rst +++ b/doc/source/getting_started/comparison/includes/extract_substring.rst @@ -4,4 +4,4 @@ indexes are zero-based. .. ipython:: python - tips["sex"].str[0:1].head() + tips["sex"].str[0:1] diff --git a/doc/source/getting_started/comparison/includes/find_substring.rst b/doc/source/getting_started/comparison/includes/find_substring.rst index ee940b64f5cae..42543d05a0014 100644 --- a/doc/source/getting_started/comparison/includes/find_substring.rst +++ b/doc/source/getting_started/comparison/includes/find_substring.rst @@ -5,4 +5,4 @@ zero-based. .. ipython:: python - tips["sex"].str.find("ale").head() + tips["sex"].str.find("ale") diff --git a/doc/source/getting_started/comparison/includes/groupby.rst b/doc/source/getting_started/comparison/includes/groupby.rst index caa9f6ec9c9b8..93d5d51e3fb00 100644 --- a/doc/source/getting_started/comparison/includes/groupby.rst +++ b/doc/source/getting_started/comparison/includes/groupby.rst @@ -4,4 +4,4 @@ pandas provides a flexible ``groupby`` mechanism that allows similar aggregation .. ipython:: python tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum() - tips_summed.head() + tips_summed diff --git a/doc/source/getting_started/comparison/includes/if_then.rst b/doc/source/getting_started/comparison/includes/if_then.rst index d7977366cfc33..f94e7588827f5 100644 --- a/doc/source/getting_started/comparison/includes/if_then.rst +++ b/doc/source/getting_started/comparison/includes/if_then.rst @@ -4,7 +4,7 @@ the ``where`` method from ``numpy``. .. ipython:: python tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high") - tips.head() + tips .. ipython:: python :suppress: diff --git a/doc/source/getting_started/comparison/includes/length.rst b/doc/source/getting_started/comparison/includes/length.rst index 5a0c803e9eff2..9141fd4ea582a 100644 --- a/doc/source/getting_started/comparison/includes/length.rst +++ b/doc/source/getting_started/comparison/includes/length.rst @@ -4,5 +4,5 @@ Use ``len`` and ``rstrip`` to exclude trailing blanks. .. ipython:: python - tips["time"].str.len().head() - tips["time"].str.rstrip().str.len().head() + tips["time"].str.len() + tips["time"].str.rstrip().str.len() diff --git a/doc/source/getting_started/comparison/includes/limit.rst b/doc/source/getting_started/comparison/includes/limit.rst new file mode 100644 index 0000000000000..4efeb4e43d07c --- /dev/null +++ b/doc/source/getting_started/comparison/includes/limit.rst @@ -0,0 +1,7 @@ +By default, pandas will truncate output of large ``DataFrame``\s to show the first and last rows. +This can be overridden by :ref:`changing the pandas options `, or using +:meth:`DataFrame.head` or :meth:`DataFrame.tail`. + +.. ipython:: python + + tips.head(5) diff --git a/doc/source/getting_started/comparison/includes/missing.rst b/doc/source/getting_started/comparison/includes/missing.rst index 8e6ba95e98036..341c7d5498d82 100644 --- a/doc/source/getting_started/comparison/includes/missing.rst +++ b/doc/source/getting_started/comparison/includes/missing.rst @@ -1,24 +1,31 @@ -This doesn't work in pandas. Instead, the :func:`pd.isna` or :func:`pd.notna` functions -should be used for comparisons. +In pandas, :meth:`Series.isna` and :meth:`Series.notna` can be used to filter the rows. .. ipython:: python - outer_join[pd.isna(outer_join["value_x"])] - outer_join[pd.notna(outer_join["value_x"])] + outer_join[outer_join["value_x"].isna()] + outer_join[outer_join["value_x"].notna()] -pandas also provides a variety of methods to work with missing data -- some of -which would be challenging to express in Stata. For example, there are methods to -drop all rows with any missing values, replacing missing values with a specified -value, like the mean, or forward filling from previous rows. See the -:ref:`missing data documentation` for more. +pandas provides :ref:`a variety of methods to work with missing data `. Here are some examples: + +Drop rows with missing values +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. ipython:: python - # Drop rows with any missing value outer_join.dropna() - # Fill forwards +Forward fill from previous rows +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. ipython:: python + outer_join.fillna(method="ffill") - # Impute missing values with the mean +Replace missing values with a specified value +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the mean: + +.. ipython:: python + outer_join["value_x"].fillna(outer_join["value_x"].mean()) diff --git a/doc/source/getting_started/comparison/includes/missing_intro.rst b/doc/source/getting_started/comparison/includes/missing_intro.rst index ed97f639f3f3d..366aa43d1264c 100644 --- a/doc/source/getting_started/comparison/includes/missing_intro.rst +++ b/doc/source/getting_started/comparison/includes/missing_intro.rst @@ -1,6 +1,6 @@ -Both have a representation for missing data — pandas' is the special float value ``NaN`` (not a -number). Many of the semantics are the same; for example missing data propagates through numeric -operations, and is ignored by default for aggregations. +pandas represents missing data with the special float value ``NaN`` (not a number). Many of the +semantics are the same; for example missing data propagates through numeric operations, and is +ignored by default for aggregations. .. ipython:: python diff --git a/doc/source/getting_started/comparison/includes/sorting.rst b/doc/source/getting_started/comparison/includes/sorting.rst index 0840c9dd554b7..4e2e40a18adbd 100644 --- a/doc/source/getting_started/comparison/includes/sorting.rst +++ b/doc/source/getting_started/comparison/includes/sorting.rst @@ -3,4 +3,4 @@ pandas has a :meth:`DataFrame.sort_values` method, which takes a list of columns .. ipython:: python tips = tips.sort_values(["sex", "total_bill"]) - tips.head() + tips diff --git a/doc/source/getting_started/comparison/includes/time_date.rst b/doc/source/getting_started/comparison/includes/time_date.rst index 12a00b36dc97d..fb9ee2e216cd7 100644 --- a/doc/source/getting_started/comparison/includes/time_date.rst +++ b/doc/source/getting_started/comparison/includes/time_date.rst @@ -11,7 +11,7 @@ tips[ ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"] - ].head() + ] .. ipython:: python :suppress: diff --git a/doc/source/getting_started/comparison/includes/transform.rst b/doc/source/getting_started/comparison/includes/transform.rst index 0aa5b5b298cf7..b7599471432ad 100644 --- a/doc/source/getting_started/comparison/includes/transform.rst +++ b/doc/source/getting_started/comparison/includes/transform.rst @@ -5,4 +5,4 @@ succinctly expressed in one operation. gb = tips.groupby("smoker")["total_bill"] tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean") - tips.head() + tips