Skip to content

Commit 4beb027

Browse files
committed
DOC: add section on data operations to spreadsheet comparison
Match the structure of that section from the SAS/Stata comparison pages.
1 parent f90a7c8 commit 4beb027

File tree

7 files changed

+127
-40
lines changed

7 files changed

+127
-40
lines changed
138 KB
Loading

doc/source/_static/excel_filter.png

238 KB
Loading

doc/source/_static/excel_sort.png

243 KB
Loading

doc/source/getting_started/comparison/comparison_with_spreadsheets.rst

+122-39
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,127 @@ By default, desktop spreadsheet software will save to its respective file format
133133

134134
:ref:`pandas can create Excel files <io.excel_writer>`, :ref:`CSV <io.store_in_csv>`, or :ref:`a number of other formats <io>`.
135135

136-
Commonly used spreadsheet functionalities
137-
-----------------------------------------
136+
Data operations
137+
---------------
138+
139+
Operations on columns
140+
~~~~~~~~~~~~~~~~~~~~~
141+
142+
In spreadsheets, `formulas
143+
<https://support.microsoft.com/en-us/office/overview-of-formulas-in-excel-ecfdc708-9162-49e8-b993-c311f47ca173>`_
144+
are often created in individual cells and then `dragged
145+
<https://support.microsoft.com/en-us/office/copy-a-formula-by-dragging-the-fill-handle-in-excel-for-mac-dd928259-622b-473f-9a33-83aa1a63e218>`_
146+
into other cells to compute them for other columns. In pandas, you're able to do operations on whole
147+
columns directly.
148+
149+
.. include:: includes/column_operations.rst
150+
151+
Note that we aren't having to tell it to do that subtraction cell-by-cell — pandas handles that for
152+
us. See :ref:`how to create new columns derived from existing columns <10min_tut_05_columns>`.
153+
154+
155+
Filtering
156+
~~~~~~~~~
157+
158+
`In Excel, filtering is done through a graphical menu. <https://support.microsoft.com/en-us/office/filter-data-in-a-range-or-table-01832226-31b5-4568-8806-38c37dcc180e>`_
159+
160+
.. image:: ../../_static/excel_filter.png
161+
:alt: Screenshot showing filtering of the total_bill column to values greater than 10
162+
:align: center
163+
164+
.. include:: includes/filtering.rst
165+
166+
If/then logic
167+
~~~~~~~~~~~~~
168+
169+
Let's say we want to make a ``bucket`` column with values of ``low`` and ``high``, based on whether
170+
the ``total_bill`` is less or more than $10.
171+
172+
In spreadsheets, logical comparison can be done with `conditional formulas
173+
<https://support.microsoft.com/en-us/office/create-conditional-formulas-ca916c57-abd8-4b44-997c-c309b7307831>`_.
174+
We'd use a formula of ``=IF(A2 < 10, "low", "high")``, dragged to all cells in a new ``bucket``
175+
column.
176+
177+
.. image:: ../../_static/excel_conditional.png
178+
:alt: Screenshot showing the formula from above in a bucket column of the tips spreadsheet
179+
:align: center
180+
181+
.. include:: includes/if_then.rst
182+
183+
Date functionality
184+
~~~~~~~~~~~~~~~~~~
185+
186+
*This section will refer to "dates", but timestamps are handled similarly.*
187+
188+
We can think of date functionality in two parts: parsing, and output. In spreadsheets, date values
189+
are generally parsed automatically, though there is a `DATEVALUE
190+
<https://support.microsoft.com/en-us/office/datevalue-function-df8b07d4-7761-4a93-bc33-b7471bbff252>`_
191+
function if you need it. In pandas, you need to explicitly convert plain text to datetime objects,
192+
either :ref:`while reading from a CSV <io.read_csv_table.datetime>` or :ref:`once in a DataFrame
193+
<10min_tut_09_timeseries.properties>`.
194+
195+
Once parsed, spreadsheets display the dates in a default format, though `the format can be changed
196+
<https://support.microsoft.com/en-us/office/format-a-date-the-way-you-want-8e10019e-d5d8-47a1-ba95-db95123d273e>`_.
197+
In pandas, you'll generally want to keep dates as ``datetime`` objects while you're doing
198+
calculations with them. Outputting *parts* of dates (such as the year) is done through `date
199+
functions
200+
<https://support.microsoft.com/en-us/office/date-and-time-functions-reference-fd1b5961-c1ae-4677-be58-074152f97b81>`_
201+
in spreadsheets, and :ref:`datetime properties <10min_tut_09_timeseries.properties>` in pandas.
202+
203+
Given ``date1`` and ``date2`` in columns ``A`` and ``B`` of a spreadsheet, you might have these
204+
formulas:
205+
206+
.. list-table::
207+
:header-rows: 1
208+
:widths: auto
209+
210+
* - column
211+
- formula
212+
* - ``date1_year``
213+
- ``=YEAR(A2)``
214+
* - ``date2_month``
215+
- ``=MONTH(B2)``
216+
* - ``date1_next``
217+
- ``=DATE(YEAR(A2),MONTH(A2)+1,1)``
218+
* - ``months_between``
219+
- ``=DATEDIF(A2,B2,"M")``
220+
221+
The equivalent pandas operations are shown below.
222+
223+
.. include:: includes/time_date.rst
224+
225+
See :ref:`timeseries` for more details.
226+
227+
228+
Selection of columns
229+
~~~~~~~~~~~~~~~~~~~~
230+
231+
In spreadsheets, you can select columns you want by:
232+
233+
- `Hiding columns <https://support.microsoft.com/en-us/office/hide-or-show-rows-or-columns-659c2cad-802e-44ee-a614-dde8443579f8>`_
234+
- `Deleting columns <https://support.microsoft.com/en-us/office/insert-or-delete-rows-and-columns-6f40e6e4-85af-45e0-b39d-65dd504a3246>`_
235+
- `Referencing a range <https://support.microsoft.com/en-us/office/create-or-change-a-cell-reference-c7b8b95d-c594-4488-947e-c835903cebaa>`_ from one worksheet into another
236+
237+
Since spreadsheet columns are typically `named in a header row
238+
<https://support.microsoft.com/en-us/office/turn-excel-table-headers-on-or-off-c91d1742-312c-4480-820f-cf4b534c8b3b>`_,
239+
renaming a column is simply a matter of changing the text in that first cell.
240+
241+
.. include:: includes/column_selection.rst
242+
243+
244+
Sorting by values
245+
~~~~~~~~~~~~~~~~~
246+
247+
Sorting in spreadsheets is accomplished via `the sort dialog <https://support.microsoft.com/en-us/office/sort-data-in-a-range-or-table-62d0b95d-2a90-4610-a6ae-2e545c4a4654>`_.
248+
249+
.. image:: ../../_static/excel_sort.png
250+
:alt: Screenshot dialog from Excel showing sorting by the sex then total_bill columns
251+
:align: center
252+
253+
.. include:: includes/sorting.rst
254+
255+
Other considerations
256+
--------------------
138257

139258
Fill Handle
140259
~~~~~~~~~~~
@@ -157,21 +276,6 @@ This can be achieved by creating a series and assigning it to the desired cells.
157276
158277
df
159278
160-
Filters
161-
~~~~~~~
162-
163-
Filters can be achieved by using slicing.
164-
165-
The examples filter by 0 on column AAA, and also show how to filter by multiple
166-
values.
167-
168-
.. ipython:: python
169-
170-
df[df.AAA == 0]
171-
172-
df[(df.AAA == 0) | (df.AAA == 2)]
173-
174-
175279
Drop Duplicates
176280
~~~~~~~~~~~~~~~
177281

@@ -192,7 +296,6 @@ This is supported in pandas via :meth:`~DataFrame.drop_duplicates`.
192296
193297
df.drop_duplicates(["class", "student_count"])
194298
195-
196299
Pivot Tables
197300
~~~~~~~~~~~~
198301

@@ -203,6 +306,7 @@ let's find the average gratuity by size of the party and sex of the server.
203306
In Excel, we use the following configuration for the PivotTable:
204307

205308
.. image:: ../../_static/excel_pivot.png
309+
:alt: Screenshot showing a PivotTable in Excel, using sex as the column, size as the rows, then average tip as the values
206310
:align: center
207311

208312
The equivalent in pandas:
@@ -213,27 +317,6 @@ The equivalent in pandas:
213317
tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average
214318
)
215319
216-
Formulas
217-
~~~~~~~~
218-
219-
In spreadsheets, `formulas <https://support.microsoft.com/en-us/office/overview-of-formulas-in-excel-ecfdc708-9162-49e8-b993-c311f47ca173>`_
220-
are often created in individual cells and then `dragged <https://support.microsoft.com/en-us/office/copy-a-formula-by-dragging-the-fill-handle-in-excel-for-mac-dd928259-622b-473f-9a33-83aa1a63e218>`_
221-
into other cells to compute them for other columns. In pandas, you'll be doing more operations on
222-
full columns.
223-
224-
As an example, let's create a new column "girls_count" and try to compute the number of boys in
225-
each class.
226-
227-
.. ipython:: python
228-
229-
df["girls_count"] = [21, 12, 21, 31, 23, 17]
230-
df
231-
df["boys_count"] = df["student_count"] - df["girls_count"]
232-
df
233-
234-
Note that we aren't having to tell it to do that subtraction cell-by-cell — pandas handles that for
235-
us. See :ref:`how to create new columns derived from existing columns <10min_tut_05_columns>`.
236-
237320
VLOOKUP
238321
~~~~~~~
239322

doc/source/getting_started/comparison/includes/column_operations.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
pandas provides similar vectorized operations by specifying the individual ``Series`` in the
1+
pandas provides vectorized operations by specifying the individual ``Series`` in the
22
``DataFrame``. New columns can be assigned in the same way. The :meth:`DataFrame.drop` method drops
33
a column from the ``DataFrame``.
44

doc/source/getting_started/intro_tutorials/09_timeseries.rst

+2
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ Westminster* in respectively Paris, Antwerp and London.
5858
How to handle time series data with ease?
5959
-----------------------------------------
6060

61+
.. _10min_tut_09_timeseries.properties:
62+
6163
Using pandas datetime properties
6264
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6365

doc/source/user_guide/io.rst

+2
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,8 @@ verbose : boolean, default ``False``
232232
skip_blank_lines : boolean, default ``True``
233233
If ``True``, skip over blank lines rather than interpreting as NaN values.
234234

235+
.. _io.read_csv_table.datetime:
236+
235237
Datetime handling
236238
+++++++++++++++++
237239

0 commit comments

Comments
 (0)