diff --git a/doc/source/_static/spreadsheets/conditional.png b/doc/source/_static/spreadsheets/conditional.png new file mode 100644 index 0000000000000..d518ff19dc760 Binary files /dev/null and b/doc/source/_static/spreadsheets/conditional.png differ diff --git a/doc/source/_static/spreadsheets/filter.png b/doc/source/_static/spreadsheets/filter.png new file mode 100644 index 0000000000000..b4c929793ca44 Binary files /dev/null and b/doc/source/_static/spreadsheets/filter.png differ diff --git a/doc/source/_static/spreadsheets/find.png b/doc/source/_static/spreadsheets/find.png new file mode 100644 index 0000000000000..223b2e6fc762f Binary files /dev/null and b/doc/source/_static/spreadsheets/find.png differ diff --git a/doc/source/_static/logo_excel.svg b/doc/source/_static/spreadsheets/logo_excel.svg similarity index 100% rename from doc/source/_static/logo_excel.svg rename to doc/source/_static/spreadsheets/logo_excel.svg diff --git a/doc/source/_static/excel_pivot.png b/doc/source/_static/spreadsheets/pivot.png similarity index 100% rename from doc/source/_static/excel_pivot.png rename to doc/source/_static/spreadsheets/pivot.png diff --git a/doc/source/_static/spreadsheets/sort.png b/doc/source/_static/spreadsheets/sort.png new file mode 100644 index 0000000000000..253f2f3bfb9ba Binary files /dev/null and b/doc/source/_static/spreadsheets/sort.png differ diff --git a/doc/source/_static/spreadsheets/vlookup.png b/doc/source/_static/spreadsheets/vlookup.png new file mode 100644 index 0000000000000..e96da01da1eeb Binary files /dev/null and b/doc/source/_static/spreadsheets/vlookup.png differ diff --git a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst index 7b779b02e20f8..e9d687bc07999 100644 --- a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst +++ b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst @@ -52,9 +52,12 @@ pandas, if no index is specified, a :class:`~pandas.RangeIndex` is used by defau second row = 1, and so on), analogous to row headings/numbers in spreadsheets. In pandas, indexes can be set to one (or multiple) unique values, which is like having a column that -use use as the row identifier in a worksheet. Unlike spreadsheets, these ``Index`` values can actually be -used to reference the rows. For example, in spreadsheets, you would reference the first row as ``A1:Z1``, -while in pandas you could use ``populations.loc['Chicago']``. +is used as the row identifier in a worksheet. Unlike most spreadsheets, these ``Index`` values can +actually be used to reference the rows. (Note that `this can be done in Excel with structured +references +`_.) +For example, in spreadsheets, you would reference the first row as ``A1:Z1``, while in pandas you +could use ``populations.loc['Chicago']``. Index values are also persistent, so if you re-order the rows in a ``DataFrame``, the label for a particular row don't change. @@ -62,11 +65,18 @@ particular row don't change. See the :ref:`indexing documentation` for much more on how to use an ``Index`` effectively. -Commonly used spreadsheet functionalities ------------------------------------------ +Data input / output +------------------- -Importing data -~~~~~~~~~~~~~~ +Constructing a DataFrame from values +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In a spreadsheet, `values can be typed directly into cells `_. + +.. include:: includes/construct_dataframe.rst + +Reading external data +~~~~~~~~~~~~~~~~~~~~~ Both `Excel `__ and :ref:`pandas <10min_tut_02_read_write>` can import data from various sources in various @@ -96,6 +106,248 @@ In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read tips = pd.read_csv(url) tips +Like `Excel's Text Import Wizard `_, +``read_csv`` can take a number of parameters to specify how the data should be parsed. For +example, if the data was instead tab delimited, and did not have column names, the pandas command +would be: + +.. code-block:: python + + tips = pd.read_csv("tips.csv", sep="\t", header=None) + + # alternatively, read_table is an alias to read_csv with tab delimiter + tips = pd.read_table("tips.csv", header=None) + + +Limiting output +~~~~~~~~~~~~~~~ + +Spreadsheet programs will only show one screenful of data at a time and then allow you to scroll, so +there isn't really a need to limit output. In pandas, you'll need to put a little more thought into +controlling how your ``DataFrame``\s are displayed. + +.. include:: includes/limit.rst + + +Exporting data +~~~~~~~~~~~~~~ + +By default, desktop spreadsheet software will save to its respective file format (``.xlsx``, ``.ods``, etc). You can, however, `save to other file formats `_. + +:ref:`pandas can create Excel files `, :ref:`CSV `, or :ref:`a number of other formats `. + +Data operations +--------------- + +Operations on columns +~~~~~~~~~~~~~~~~~~~~~ + +In spreadsheets, `formulas +`_ +are often created in individual cells and then `dragged +`_ +into other cells to compute them for other columns. In pandas, you're able to do operations on whole +columns directly. + +.. include:: includes/column_operations.rst + +Note that we aren't having to tell it to do that subtraction cell-by-cell — pandas handles that for +us. See :ref:`how to create new columns derived from existing columns <10min_tut_05_columns>`. + + +Filtering +~~~~~~~~~ + +`In Excel, filtering is done through a graphical menu. `_ + +.. image:: ../../_static/spreadsheets/filter.png + :alt: Screenshot showing filtering of the total_bill column to values greater than 10 + :align: center + +.. include:: includes/filtering.rst + +If/then logic +~~~~~~~~~~~~~ + +Let's say we want to make a ``bucket`` column with values of ``low`` and ``high``, based on whether +the ``total_bill`` is less or more than $10. + +In spreadsheets, logical comparison can be done with `conditional formulas +`_. +We'd use a formula of ``=IF(A2 < 10, "low", "high")``, dragged to all cells in a new ``bucket`` +column. + +.. image:: ../../_static/spreadsheets/conditional.png + :alt: Screenshot showing the formula from above in a bucket column of the tips spreadsheet + :align: center + +.. include:: includes/if_then.rst + +Date functionality +~~~~~~~~~~~~~~~~~~ + +*This section will refer to "dates", but timestamps are handled similarly.* + +We can think of date functionality in two parts: parsing, and output. In spreadsheets, date values +are generally parsed automatically, though there is a `DATEVALUE +`_ +function if you need it. In pandas, you need to explicitly convert plain text to datetime objects, +either :ref:`while reading from a CSV ` or :ref:`once in a DataFrame +<10min_tut_09_timeseries.properties>`. + +Once parsed, spreadsheets display the dates in a default format, though `the format can be changed +`_. +In pandas, you'll generally want to keep dates as ``datetime`` objects while you're doing +calculations with them. Outputting *parts* of dates (such as the year) is done through `date +functions +`_ +in spreadsheets, and :ref:`datetime properties <10min_tut_09_timeseries.properties>` in pandas. + +Given ``date1`` and ``date2`` in columns ``A`` and ``B`` of a spreadsheet, you might have these +formulas: + +.. list-table:: + :header-rows: 1 + :widths: auto + + * - column + - formula + * - ``date1_year`` + - ``=YEAR(A2)`` + * - ``date2_month`` + - ``=MONTH(B2)`` + * - ``date1_next`` + - ``=DATE(YEAR(A2),MONTH(A2)+1,1)`` + * - ``months_between`` + - ``=DATEDIF(A2,B2,"M")`` + +The equivalent pandas operations are shown below. + +.. include:: includes/time_date.rst + +See :ref:`timeseries` for more details. + + +Selection of columns +~~~~~~~~~~~~~~~~~~~~ + +In spreadsheets, you can select columns you want by: + +- `Hiding columns `_ +- `Deleting columns `_ +- `Referencing a range `_ from one worksheet into another + +Since spreadsheet columns are typically `named in a header row +`_, +renaming a column is simply a matter of changing the text in that first cell. + +.. include:: includes/column_selection.rst + + +Sorting by values +~~~~~~~~~~~~~~~~~ + +Sorting in spreadsheets is accomplished via `the sort dialog `_. + +.. image:: ../../_static/spreadsheets/sort.png + :alt: Screenshot of dialog from Excel showing sorting by the sex then total_bill columns + :align: center + +.. include:: includes/sorting.rst + +String processing +----------------- + +Finding length of string +~~~~~~~~~~~~~~~~~~~~~~~~ + +In spreadsheets, the number of characters in text can be found with the `LEN +`_ +function. This can be used with the `TRIM +`_ +function to remove extra whitespace. + +:: + + =LEN(TRIM(A2)) + +.. include:: includes/length.rst + +Note this will still include multiple spaces within the string, so isn't 100% equivalent. + + +Finding position of substring +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The `FIND +`_ +spreadsheet function returns the position of a substring, with the first character being ``1``. + +.. image:: ../../_static/spreadsheets/sort.png + :alt: Screenshot of FIND formula being used in Excel + :align: center + +.. include:: includes/find_substring.rst + + +Extracting substring by position +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Spreadsheets have a `MID +`_ +formula for extracting a substring from a given position. To get the first character:: + + =MID(A2,1,1) + +.. include:: includes/extract_substring.rst + + +Extracting nth word +~~~~~~~~~~~~~~~~~~~ + +In Excel, you might use the `Text to Columns Wizard +`_ +for splitting text and retrieving a specific column. (Note `it's possible to do so through a formula +as well `_.) + +.. include:: includes/nth_word.rst + + +Changing case +~~~~~~~~~~~~~ + +Spreadsheets provide `UPPER, LOWER, and PROPER functions +`_ +for converting text to upper, lower, and title case, respectively. + +.. include:: includes/case.rst + + +Merging +------- + +.. include:: includes/merge_setup.rst + +In Excel, there are `merging of tables can be done through a VLOOKUP +`_. + +.. image:: ../../_static/spreadsheets/vlookup.png + :alt: Screenshot showing a VLOOKUP formula between two tables in Excel, with some values being filled in and others with "#N/A" + :align: center + +.. include:: includes/merge.rst + +``merge`` has a number of advantages over ``VLOOKUP``: + +* The lookup value doesn't need to be the first column of the lookup table +* If multiple rows are matched, there will be one row for each match, instead of just the first +* It will include all columns from the lookup table, instead of just a single specified column +* It supports :ref:`more complex join operations ` + + +Other considerations +-------------------- + Fill Handle ~~~~~~~~~~~ @@ -117,21 +369,6 @@ This can be achieved by creating a series and assigning it to the desired cells. df -Filters -~~~~~~~ - -Filters can be achieved by using slicing. - -The examples filter by 0 on column AAA, and also show how to filter by multiple -values. - -.. ipython:: python - - df[df.AAA == 0] - - df[(df.AAA == 0) | (df.AAA == 2)] - - Drop Duplicates ~~~~~~~~~~~~~~~ @@ -152,7 +389,6 @@ This is supported in pandas via :meth:`~DataFrame.drop_duplicates`. df.drop_duplicates(["class", "student_count"]) - Pivot Tables ~~~~~~~~~~~~ @@ -162,7 +398,8 @@ let's find the average gratuity by size of the party and sex of the server. In Excel, we use the following configuration for the PivotTable: -.. image:: ../../_static/excel_pivot.png +.. image:: ../../_static/spreadsheets/pivot.png + :alt: Screenshot showing a PivotTable in Excel, using sex as the column, size as the rows, then average tip as the values :align: center The equivalent in pandas: @@ -173,81 +410,34 @@ The equivalent in pandas: tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average ) -Formulas -~~~~~~~~ -In spreadsheets, `formulas `_ -are often created in individual cells and then `dragged `_ -into other cells to compute them for other columns. In pandas, you'll be doing more operations on -full columns. +Adding a row +~~~~~~~~~~~~ -As an example, let's create a new column "girls_count" and try to compute the number of boys in -each class. +Assuming we are using a :class:`~pandas.RangeIndex` (numbered ``0``, ``1``, etc.), we can use :meth:`DataFrame.append` to add a row to the bottom of a ``DataFrame``. .. ipython:: python - df["girls_count"] = [21, 12, 21, 31, 23, 17] - df - df["boys_count"] = df["student_count"] - df["girls_count"] df + new_row = {"class": "E", "student_count": 51, "all_pass": True} + df.append(new_row, ignore_index=True) -Note that we aren't having to tell it to do that subtraction cell-by-cell — pandas handles that for -us. See :ref:`how to create new columns derived from existing columns <10min_tut_05_columns>`. -VLOOKUP -~~~~~~~ - -.. ipython:: python +Find and Replace +~~~~~~~~~~~~~~~~ - import random - - first_names = [ - "harry", - "ron", - "hermione", - "rubius", - "albus", - "severus", - "luna", - ] - keys = [1, 2, 3, 4, 5, 6, 7] - df1 = pd.DataFrame({"keys": keys, "first_names": first_names}) - df1 - - surnames = [ - "hadrid", - "malfoy", - "lovegood", - "dumbledore", - "grindelwald", - "granger", - "weasly", - "riddle", - "longbottom", - "snape", - ] - keys = [random.randint(1, 7) for x in range(0, 10)] - random_names = pd.DataFrame({"surnames": surnames, "keys": keys}) - - random_names - - random_names.merge(df1, on="keys", how="left") - -Adding a row -~~~~~~~~~~~~ - -To appended a row, we can just assign values to an index using :meth:`~DataFrame.loc`. - -NOTE: If the index already exists, the values in that index will be over written. +`Excel's Find dialog `_ +takes you to cells that match, one by one. In pandas, this operation is generally done for an +entire column or ``DataFrame`` at once through :ref:`conditional expressions <10min_tut_03_subset.rows_and_columns>`. .. ipython:: python - df1.loc[7] = [8, "tonks"] - df1 + tips + tips == "Sun" + tips["day"].str.contains("S") +pandas' :meth:`~DataFrame.replace` is comparable to Excel's ``Replace All``. -Search and Replace -~~~~~~~~~~~~~~~~~~ +.. ipython:: python -The ``replace`` method that comes associated with the ``DataFrame`` object can perform -this function. Please see `pandas.DataFrame.replace `__ for examples. + tips.replace("Thur", "Thu") diff --git a/doc/source/getting_started/comparison/includes/column_operations.rst b/doc/source/getting_started/comparison/includes/column_operations.rst index bc5db8e6b8038..b23b931ed2db1 100644 --- a/doc/source/getting_started/comparison/includes/column_operations.rst +++ b/doc/source/getting_started/comparison/includes/column_operations.rst @@ -1,4 +1,4 @@ -pandas provides similar vectorized operations by specifying the individual ``Series`` in the +pandas provides vectorized operations by specifying the individual ``Series`` in the ``DataFrame``. New columns can be assigned in the same way. The :meth:`DataFrame.drop` method drops a column from the ``DataFrame``. diff --git a/doc/source/getting_started/index.rst b/doc/source/getting_started/index.rst index de47bd5b72148..cd5dfb84fee31 100644 --- a/doc/source/getting_started/index.rst +++ b/doc/source/getting_started/index.rst @@ -626,7 +626,7 @@ the pandas-equivalent operations compared to software you already know:
- Excel logo + Excel logo

Users of Excel or other spreadsheet programs will find that many of the concepts are transferrable to pandas.

diff --git a/doc/source/getting_started/intro_tutorials/03_subset_data.rst b/doc/source/getting_started/intro_tutorials/03_subset_data.rst index fe3eae6c42959..4106b0e064823 100644 --- a/doc/source/getting_started/intro_tutorials/03_subset_data.rst +++ b/doc/source/getting_started/intro_tutorials/03_subset_data.rst @@ -268,6 +268,8 @@ For more dedicated functions on missing values, see the user guide section about
+.. _10min_tut_03_subset.rows_and_columns: + How do I select specific rows and columns from a ``DataFrame``? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/getting_started/intro_tutorials/09_timeseries.rst b/doc/source/getting_started/intro_tutorials/09_timeseries.rst index 598d3514baa15..b9cab0747196e 100644 --- a/doc/source/getting_started/intro_tutorials/09_timeseries.rst +++ b/doc/source/getting_started/intro_tutorials/09_timeseries.rst @@ -58,6 +58,8 @@ Westminster* in respectively Paris, Antwerp and London. How to handle time series data with ease? ----------------------------------------- +.. _10min_tut_09_timeseries.properties: + Using pandas datetime properties ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index 9c9ad9538f488..1156ddd6da410 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -232,6 +232,8 @@ verbose : boolean, default ``False`` skip_blank_lines : boolean, default ``True`` If ``True``, skip over blank lines rather than interpreting as NaN values. +.. _io.read_csv_table.datetime: + Datetime handling +++++++++++++++++