Skip to content

DOC: remove use of head() in the comparison docs #38935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 4, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 19 additions & 37 deletions doc/source/getting_started/comparison/comparison_with_sas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,13 @@

Comparison with SAS
********************

For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__
this page is meant to demonstrate how different SAS operations would be
performed in pandas.

.. include:: includes/introduction.rst

.. note::

Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling
``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``.
Copy link
Member Author

@afeld afeld Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer the case. Ditto for the Stata page.

This is often used in interactive work (e.g. `Jupyter notebook
<https://jupyter.org/>`_ or terminal) - the equivalent in SAS would be:

.. code-block:: sas

proc print data=df(obs=5);
run;

Data structures
---------------
Expand Down Expand Up @@ -120,7 +110,7 @@ The pandas method is :func:`read_csv`, which works similarly.
"pandas/master/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips.head()
tips


Like ``PROC IMPORT``, ``read_csv`` can take a number of parameters to specify
Expand All @@ -138,6 +128,19 @@ In addition to text/csv, pandas supports a variety of other data formats
such as Excel, HDF5, and SQL databases. These are all read via a ``pd.read_*``
function. See the :ref:`IO documentation<io>` for more details.

Limiting output
~~~~~~~~~~~~~~~
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New sections in each page.


.. include:: includes/limit.rst

The equivalent in SAS would be:

.. code-block:: sas

proc print data=df(obs=5);
run;


Exporting data
~~~~~~~~~~~~~~

Expand Down Expand Up @@ -173,20 +176,8 @@ be used on new or existing columns.
new_bill = total_bill / 2;
run;

pandas provides similar vectorized operations by
specifying the individual ``Series`` in the ``DataFrame``.
New columns can be assigned in the same way.
.. include:: includes/column_operations.rst

.. ipython:: python

tips["total_bill"] = tips["total_bill"] - 2
tips["new_bill"] = tips["total_bill"] / 2.0
tips.head()

.. ipython:: python
:suppress:

tips = tips.drop("new_bill", axis=1)

Filtering
~~~~~~~~~
Expand Down Expand Up @@ -278,18 +269,7 @@ drop, and rename columns.
rename total_bill=total_bill_2;
run;

The same operations are expressed in pandas below.

.. ipython:: python

# keep
tips[["sex", "total_bill", "tip"]].head()

# drop
tips.drop("sex", axis=1).head()

# rename
tips.rename(columns={"total_bill": "total_bill_2"}).head()
.. include:: includes/column_selection.rst


Sorting by values
Expand Down Expand Up @@ -442,6 +422,8 @@ input frames.
Missing data
------------

Both pandas and SAS have a representation for missing data.

.. include:: includes/missing_intro.rst

One difference is that missing data cannot be compared to its sentinel value.
Expand Down
26 changes: 19 additions & 7 deletions doc/source/getting_started/comparison/comparison_with_sql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ structure.
"/pandas/master/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips.head()
tips

SELECT
------
Expand All @@ -31,14 +31,13 @@ to select all columns):
.. code-block:: sql

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
FROM tips;

With pandas, column selection is done by passing a list of column names to your DataFrame:

.. ipython:: python

tips[["total_bill", "tip", "smoker", "time"]].head(5)
tips[["total_bill", "tip", "smoker", "time"]]

Calling the DataFrame without the list of column names would display all columns (akin to SQL's
``*``).
Expand All @@ -48,14 +47,13 @@ In SQL, you can add a calculated column:
.. code-block:: sql

SELECT *, tip/total_bill as tip_rate
FROM tips
LIMIT 5;
FROM tips;

With pandas, you can use the :meth:`DataFrame.assign` method of a DataFrame to append a new column:

.. ipython:: python

tips.assign(tip_rate=tips["tip"] / tips["total_bill"]).head(5)
tips.assign(tip_rate=tips["tip"] / tips["total_bill"])

WHERE
-----
Expand Down Expand Up @@ -368,6 +366,20 @@ In pandas, you can use :meth:`~pandas.concat` in conjunction with

pd.concat([df1, df2]).drop_duplicates()


LIMIT
-----

.. code-block:: sql

SELECT * FROM tips
LIMIT 10;

.. ipython:: python

tips.head(10)


pandas equivalents for some SQL analytic and aggregate functions
----------------------------------------------------------------

Expand Down
55 changes: 18 additions & 37 deletions doc/source/getting_started/comparison/comparison_with_stata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,6 @@ performed in pandas.

.. include:: includes/introduction.rst

.. note::

Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling
``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``.
This is often used in interactive work (e.g. `Jupyter notebook
<https://jupyter.org/>`_ or terminal) -- the equivalent in Stata would be:

.. code-block:: stata

list in 1/5

Data structures
---------------
Expand Down Expand Up @@ -116,7 +106,7 @@ the data set if presented with a url.
"/pandas/master/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips.head()
tips

Like ``import delimited``, :func:`read_csv` can take a number of parameters to specify
how the data should be parsed. For example, if the data were instead tab delimited,
Expand All @@ -141,6 +131,18 @@ such as Excel, SAS, HDF5, Parquet, and SQL databases. These are all read via a
function. See the :ref:`IO documentation<io>` for more details.


Limiting output
~~~~~~~~~~~~~~~

.. include:: includes/limit.rst

The equivalent in Stata would be:

.. code-block:: stata

list in 1/5


Exporting data
~~~~~~~~~~~~~~

Expand Down Expand Up @@ -179,18 +181,8 @@ the column from the data set.
generate new_bill = total_bill / 2
drop new_bill

pandas provides similar vectorized operations by
specifying the individual ``Series`` in the ``DataFrame``.
New columns can be assigned in the same way. The :meth:`DataFrame.drop` method
drops a column from the ``DataFrame``.
.. include:: includes/column_operations.rst

.. ipython:: python

tips["total_bill"] = tips["total_bill"] - 2
tips["new_bill"] = tips["total_bill"] / 2
tips.head()

tips = tips.drop("new_bill", axis=1)

Filtering
~~~~~~~~~
Expand Down Expand Up @@ -256,20 +248,7 @@ Stata provides keywords to select, drop, and rename columns.

rename total_bill total_bill_2

The same operations are expressed in pandas below. Note that in contrast to Stata, these
operations do not happen in place. To make these changes persist, assign the operation back
to a variable.

.. ipython:: python

# keep
tips[["sex", "total_bill", "tip"]].head()

# drop
tips.drop("sex", axis=1).head()

# rename
tips.rename(columns={"total_bill": "total_bill_2"}).head()
.. include:: includes/column_selection.rst


Sorting by values
Expand Down Expand Up @@ -428,12 +407,14 @@ or the intersection of the two by using the values created in the
restore
merge 1:n key using df2.dta

.. include:: includes/merge_setup.rst
.. include:: includes/merge.rst


Missing data
------------

Both pandas and Stata have a representation for missing data.

.. include:: includes/missing_intro.rst

One difference is that missing data cannot be compared to its sentinel value.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
pandas provides similar vectorized operations by specifying the individual ``Series`` in the
``DataFrame``. New columns can be assigned in the same way. The :meth:`DataFrame.drop` method drops
a column from the ``DataFrame``.

.. ipython:: python

tips["total_bill"] = tips["total_bill"] - 2
tips["new_bill"] = tips["total_bill"] / 2
tips

tips = tips.drop("new_bill", axis=1)
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
The same operations are expressed in pandas below. Note that these operations do not happen in
place. To make these changes persist, assign the operation back to a variable.

Keep certain columns
''''''''''''''''''''
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These headings are new.


.. ipython:: python

tips[["sex", "total_bill", "tip"]]

Drop a column
'''''''''''''

.. ipython:: python

tips.drop("sex", axis=1)

Rename a column
'''''''''''''''

.. ipython:: python

tips.rename(columns={"total_bill": "total_bill_2"})
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ indexes are zero-based.

.. ipython:: python

tips["sex"].str[0:1].head()
tips["sex"].str[0:1]
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ zero-based.

.. ipython:: python

tips["sex"].str.find("ale").head()
tips["sex"].str.find("ale")
2 changes: 1 addition & 1 deletion doc/source/getting_started/comparison/includes/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ pandas provides a flexible ``groupby`` mechanism that allows similar aggregation
.. ipython:: python

tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()
tips_summed.head()
tips_summed
2 changes: 1 addition & 1 deletion doc/source/getting_started/comparison/includes/if_then.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ the ``where`` method from ``numpy``.
.. ipython:: python

tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")
tips.head()
tips

.. ipython:: python
:suppress:
Expand Down
4 changes: 2 additions & 2 deletions doc/source/getting_started/comparison/includes/length.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ Use ``len`` and ``rstrip`` to exclude trailing blanks.

.. ipython:: python

tips["time"].str.len().head()
tips["time"].str.rstrip().str.len().head()
tips["time"].str.len()
tips["time"].str.rstrip().str.len()
7 changes: 7 additions & 0 deletions doc/source/getting_started/comparison/includes/limit.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
By default, pandas will truncate output of large ``DataFrame``\s to show the first and last rows.
This can be overridden by :ref:`changing the pandas options <options>`, or using
:meth:`DataFrame.head` or :meth:`DataFrame.tail`.

.. ipython:: python

tips.head(5)
31 changes: 19 additions & 12 deletions doc/source/getting_started/comparison/includes/missing.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,31 @@
This doesn't work in pandas. Instead, the :func:`pd.isna` or :func:`pd.notna` functions
should be used for comparisons.
In pandas, :meth:`Series.isna` and :meth:`Series.notna` can be used to filter the rows.

.. ipython:: python

outer_join[pd.isna(outer_join["value_x"])]
outer_join[pd.notna(outer_join["value_x"])]
outer_join[outer_join["value_x"].isna()]
outer_join[outer_join["value_x"].notna()]

pandas also provides a variety of methods to work with missing data -- some of
which would be challenging to express in Stata. For example, there are methods to
drop all rows with any missing values, replacing missing values with a specified
value, like the mean, or forward filling from previous rows. See the
:ref:`missing data documentation<missing_data>` for more.
pandas provides :ref:`a variety of methods to work with missing data <missing_data>`. Here are some examples:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out the examples to sub-headings below.


Drop rows with missing values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. ipython:: python

# Drop rows with any missing value
outer_join.dropna()

# Fill forwards
Forward fill from previous rows
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. ipython:: python

outer_join.fillna(method="ffill")

# Impute missing values with the mean
Replace missing values with a specified value
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using the mean:

.. ipython:: python

outer_join["value_x"].fillna(outer_join["value_x"].mean())
Loading