diff --git a/doc/source/_static/excel_pivot.png b/doc/source/_static/excel_pivot.png new file mode 100644 index 0000000000000..beacc90bc313e Binary files /dev/null and b/doc/source/_static/excel_pivot.png differ diff --git a/doc/source/_static/logo_excel.svg b/doc/source/_static/logo_excel.svg new file mode 100644 index 0000000000000..ffb25108df67c --- /dev/null +++ b/doc/source/_static/logo_excel.svg @@ -0,0 +1,27 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/doc/source/getting_started/comparison/comparison_boilerplate.rst b/doc/source/getting_started/comparison/comparison_boilerplate.rst new file mode 100644 index 0000000000000..aedf2875dc452 --- /dev/null +++ b/doc/source/getting_started/comparison/comparison_boilerplate.rst @@ -0,0 +1,9 @@ +If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>` +to familiarize yourself with the library. + +As is customary, we import pandas and NumPy as follows: + +.. ipython:: python + + import pandas as pd + import numpy as np diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst index ae9f1caebd556..c6f508aae0e21 100644 --- a/doc/source/getting_started/comparison/comparison_with_sas.rst +++ b/doc/source/getting_started/comparison/comparison_with_sas.rst @@ -8,16 +8,7 @@ For potential users coming from `SAS ` -to familiarize yourself with the library. - -As is customary, we import pandas and NumPy as follows: - -.. ipython:: python - - import pandas as pd - import numpy as np - +.. include:: comparison_boilerplate.rst .. note:: @@ -48,14 +39,17 @@ General terminology translation ``NaN``, ``.`` -``DataFrame`` / ``Series`` -~~~~~~~~~~~~~~~~~~~~~~~~~~ +``DataFrame`` +~~~~~~~~~~~~~ A ``DataFrame`` in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be applied to a data set using SAS's ``DATA`` step, can also be accomplished in pandas. +``Series`` +~~~~~~~~~~ + A ``Series`` is the data structure that represents one column of a ``DataFrame``. SAS doesn't have a separate data structure for a single column, but in general, working with a ``Series`` is analogous to referencing a column diff --git a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst new file mode 100644 index 0000000000000..73645d429cc66 --- /dev/null +++ b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst @@ -0,0 +1,253 @@ +.. _compare_with_spreadsheets: + +{{ header }} + +Comparison with spreadsheets +**************************** + +Since many potential pandas users have some familiarity with spreadsheet programs like +`Excel `_, this page is meant to provide some examples +of how various spreadsheet operations would be performed using pandas. This page will use +terminology and link to documentation for Excel, but much will be the same/similar in +`Google Sheets `_, +`LibreOffice Calc `_, +`Apple Numbers `_, and other +Excel-compatible spreadsheet software. + +.. include:: comparison_boilerplate.rst + +Data structures +--------------- + +General terminology translation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. csv-table:: + :header: "pandas", "Excel" + :widths: 20, 20 + + ``DataFrame``, worksheet + ``Series``, column + ``Index``, row headings + row, row + ``NaN``, empty cell + +``DataFrame`` +~~~~~~~~~~~~~ + +A ``DataFrame`` in pandas is analogous to an Excel worksheet. While an Excel worksheet can contain +multiple worksheets, pandas ``DataFrame``\s exist independently. + +``Series`` +~~~~~~~~~~ + +A ``Series`` is the data structure that represents one column of a ``DataFrame``. Working with a +``Series`` is analogous to referencing a column of a spreadsheet. + +``Index`` +~~~~~~~~~ + +Every ``DataFrame`` and ``Series`` has an ``Index``, which are labels on the *rows* of the data. In +pandas, if no index is specified, a :class:`~pandas.RangeIndex` is used by default (first row = 0, +second row = 1, and so on), analogous to row headings/numbers in spreadsheets. + +In pandas, indexes can be set to one (or multiple) unique values, which is like having a column that +use use as the row identifier in a worksheet. Unlike spreadsheets, these ``Index`` values can actually be +used to reference the rows. For example, in spreadsheets, you would reference the first row as ``A1:Z1``, +while in pandas you could use ``populations.loc['Chicago']``. + +Index values are also persistent, so if you re-order the rows in a ``DataFrame``, the label for a +particular row don't change. + +See the :ref:`indexing documentation` for much more on how to use an ``Index`` +effectively. + +Commonly used spreadsheet functionalities +----------------------------------------- + +Importing data +~~~~~~~~~~~~~~ + +Both `Excel `__ +and :ref:`pandas <10min_tut_02_read_write>` can import data from various sources in various +formats. + +Excel files +''''''''''' + +Excel opens `various Excel file formats `_ +by double-clicking them, or using `the Open menu `_. +In pandas, you use :ref:`special methods for reading and writing from/to Excel files `. + +CSV +''' + +Let's load and display the `tips `_ +dataset from the pandas tests, which is a CSV file. In Excel, you would download and then +`open the CSV `_. +In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read_csv`: + +.. ipython:: python + + url = ( + "https://raw.github.com/pandas-dev" + "/pandas/master/pandas/tests/io/data/csv/tips.csv" + ) + tips = pd.read_csv(url) + tips + +Fill Handle +~~~~~~~~~~~ + +Create a series of numbers following a set pattern in a certain set of cells. In +a spreadsheet, this would be done by shift+drag after entering the first number or by +entering the first two or three values and then dragging. + +This can be achieved by creating a series and assigning it to the desired cells. + +.. ipython:: python + + df = pd.DataFrame({"AAA": [1] * 8, "BBB": list(range(0, 8))}) + df + + series = list(range(1, 5)) + series + + df.loc[2:5, "AAA"] = series + + df + +Filters +~~~~~~~ + +Filters can be achieved by using slicing. + +The examples filter by 0 on column AAA, and also show how to filter by multiple +values. + +.. ipython:: python + + df[df.AAA == 0] + + df[(df.AAA == 0) | (df.AAA == 2)] + + +Drop Duplicates +~~~~~~~~~~~~~~~ + +Excel has built-in functionality for `removing duplicate values `_. +This is supported in pandas via :meth:`~DataFrame.drop_duplicates`. + +.. ipython:: python + + df = pd.DataFrame( + { + "class": ["A", "A", "A", "B", "C", "D"], + "student_count": [42, 35, 42, 50, 47, 45], + "all_pass": ["Yes", "Yes", "Yes", "No", "No", "Yes"], + } + ) + + df.drop_duplicates() + + df.drop_duplicates(["class", "student_count"]) + + +Pivot Tables +~~~~~~~~~~~~ + +`PivotTables `_ +from spreadsheets can be replicated in pandas through :ref:`reshaping`. Using the ``tips`` dataset again, +let's find the average gratuity by size of the party and sex of the server. + +In Excel, we use the following configuration for the PivotTable: + +.. image:: ../../_static/excel_pivot.png + :align: center + +The equivalent in pandas: + +.. ipython:: python + + pd.pivot_table( + tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average + ) + +Formulas +~~~~~~~~ + +In spreadsheets, `formulas `_ +are often created in individual cells and then `dragged `_ +into other cells to compute them for other columns. In pandas, you'll be doing more operations on +full columns. + +As an example, let's create a new column "girls_count" and try to compute the number of boys in +each class. + +.. ipython:: python + + df["girls_count"] = [21, 12, 21, 31, 23, 17] + df + df["boys_count"] = df["student_count"] - df["girls_count"] + df + +Note that we aren't having to tell it to do that subtraction cell-by-cell — pandas handles that for +us. See :ref:`how to create new columns derived from existing columns <10min_tut_05_columns>`. + +VLOOKUP +~~~~~~~ + +.. ipython:: python + + import random + + first_names = [ + "harry", + "ron", + "hermione", + "rubius", + "albus", + "severus", + "luna", + ] + keys = [1, 2, 3, 4, 5, 6, 7] + df1 = pd.DataFrame({"keys": keys, "first_names": first_names}) + df1 + + surnames = [ + "hadrid", + "malfoy", + "lovegood", + "dumbledore", + "grindelwald", + "granger", + "weasly", + "riddle", + "longbottom", + "snape", + ] + keys = [random.randint(1, 7) for x in range(0, 10)] + random_names = pd.DataFrame({"surnames": surnames, "keys": keys}) + + random_names + + random_names.merge(df1, on="keys", how="left") + +Adding a row +~~~~~~~~~~~~ + +To appended a row, we can just assign values to an index using :meth:`~DataFrame.loc`. + +NOTE: If the index already exists, the values in that index will be over written. + +.. ipython:: python + + df1.loc[7] = [8, "tonks"] + df1 + + +Search and Replace +~~~~~~~~~~~~~~~~~~ + +The ``replace`` method that comes associated with the ``DataFrame`` object can perform +this function. Please see `pandas.DataFrame.replace `__ for examples. diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst index 6848d8df2e46b..4fe7b7e96cf50 100644 --- a/doc/source/getting_started/comparison/comparison_with_sql.rst +++ b/doc/source/getting_started/comparison/comparison_with_sql.rst @@ -8,15 +8,7 @@ Since many potential pandas users have some familiarity with `SQL `_, this page is meant to provide some examples of how various SQL operations would be performed using pandas. -If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>` -to familiarize yourself with the library. - -As is customary, we import pandas and NumPy as follows: - -.. ipython:: python - - import pandas as pd - import numpy as np +.. include:: comparison_boilerplate.rst Most of the examples will utilize the ``tips`` dataset found within pandas tests. We'll read the data into a DataFrame called ``tips`` and assume we have a database table of the same name and diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index 014506cc18327..b3ed9b1ba630f 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -8,17 +8,7 @@ For potential users coming from `Stata `__ this page is meant to demonstrate how different Stata operations would be performed in pandas. -If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>` -to familiarize yourself with the library. - -As is customary, we import pandas and NumPy as follows. This means that we can refer to the -libraries as ``pd`` and ``np``, respectively, for the rest of the document. - -.. ipython:: python - - import pandas as pd - import numpy as np - +.. include:: comparison_boilerplate.rst .. note:: @@ -48,14 +38,17 @@ General terminology translation ``NaN``, ``.`` -``DataFrame`` / ``Series`` -~~~~~~~~~~~~~~~~~~~~~~~~~~ +``DataFrame`` +~~~~~~~~~~~~~ A ``DataFrame`` in pandas is analogous to a Stata data set -- a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be applied to a data set in Stata can also be accomplished in pandas. +``Series`` +~~~~~~~~~~ + A ``Series`` is the data structure that represents one column of a ``DataFrame``. Stata doesn't have a separate data structure for a single column, but in general, working with a ``Series`` is analogous to referencing a column diff --git a/doc/source/getting_started/comparison/index.rst b/doc/source/getting_started/comparison/index.rst index 998706ce0c639..c3f58ce1f3d6d 100644 --- a/doc/source/getting_started/comparison/index.rst +++ b/doc/source/getting_started/comparison/index.rst @@ -11,5 +11,6 @@ Comparison with other tools comparison_with_r comparison_with_sql + comparison_with_spreadsheets comparison_with_sas comparison_with_stata diff --git a/doc/source/getting_started/index.rst b/doc/source/getting_started/index.rst index 6f6eeada0cfed..de47bd5b72148 100644 --- a/doc/source/getting_started/index.rst +++ b/doc/source/getting_started/index.rst @@ -619,6 +619,22 @@ the pandas-equivalent operations compared to software you already know: :ref:`Learn more ` +.. raw:: html + + + + +
+
+ Excel logo +
+

Users of Excel + or other spreadsheet programs will find that many of the concepts are transferrable to pandas.

+ +.. container:: custom-button + + :ref:`Learn more ` + .. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/05_add_columns.rst b/doc/source/getting_started/intro_tutorials/05_add_columns.rst index a99c2c49585c5..6c7c6faf69114 100644 --- a/doc/source/getting_started/intro_tutorials/05_add_columns.rst +++ b/doc/source/getting_started/intro_tutorials/05_add_columns.rst @@ -107,11 +107,13 @@ values in each row*. -Also other mathematical operators (+, -, \*, /) or -logical operators (<, >, =,…) work element wise. The latter was already +Also other mathematical operators (``+``, ``-``, ``\*``, ``/``) or +logical operators (``<``, ``>``, ``=``,…) work element wise. The latter was already used in the :ref:`subset data tutorial <10min_tut_03_subset>` to filter rows of a table using a conditional expression. +If you need more advanced logic, you can use arbitrary Python code via :meth:`~DataFrame.apply`. + .. raw:: html