Skip to content

Commit 6f11e02

Browse files
committed
DOC: add section on string processing to spreadsheet comparison
Matches section headings from SAS and Stata pages.
1 parent 4beb027 commit 6f11e02

File tree

2 files changed

+77
-4
lines changed

2 files changed

+77
-4
lines changed

doc/source/_static/excel_find.png

67.5 KB
Loading

doc/source/getting_started/comparison/comparison_with_spreadsheets.rst

+77-4
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,12 @@ pandas, if no index is specified, a :class:`~pandas.RangeIndex` is used by defau
5252
second row = 1, and so on), analogous to row headings/numbers in spreadsheets.
5353

5454
In pandas, indexes can be set to one (or multiple) unique values, which is like having a column that
55-
use use as the row identifier in a worksheet. Unlike spreadsheets, these ``Index`` values can actually be
56-
used to reference the rows. For example, in spreadsheets, you would reference the first row as ``A1:Z1``,
57-
while in pandas you could use ``populations.loc['Chicago']``.
55+
is used as the row identifier in a worksheet. Unlike most spreadsheets, these ``Index`` values can
56+
actually be used to reference the rows. (Note that `this can be done in Excel with structured
57+
references
58+
<https://support.microsoft.com/en-us/office/using-structured-references-with-excel-tables-f5ed2452-2337-4f71-bed3-c8ae6d2b276e>`_.)
59+
For example, in spreadsheets, you would reference the first row as ``A1:Z1``, while in pandas you
60+
could use ``populations.loc['Chicago']``.
5861

5962
Index values are also persistent, so if you re-order the rows in a ``DataFrame``, the label for a
6063
particular row don't change.
@@ -247,11 +250,81 @@ Sorting by values
247250
Sorting in spreadsheets is accomplished via `the sort dialog <https://support.microsoft.com/en-us/office/sort-data-in-a-range-or-table-62d0b95d-2a90-4610-a6ae-2e545c4a4654>`_.
248251

249252
.. image:: ../../_static/excel_sort.png
250-
:alt: Screenshot dialog from Excel showing sorting by the sex then total_bill columns
253+
:alt: Screenshot of dialog from Excel showing sorting by the sex then total_bill columns
251254
:align: center
252255

253256
.. include:: includes/sorting.rst
254257

258+
String processing
259+
-----------------
260+
261+
Finding length of string
262+
~~~~~~~~~~~~~~~~~~~~~~~~
263+
264+
In spreadsheets, the number of characters in text can be found with the `LEN
265+
<https://support.microsoft.com/en-us/office/len-lenb-functions-29236f94-cedc-429d-affd-b5e33d2c67cb>`_
266+
function. This can be used with the `TRIM
267+
<https://support.microsoft.com/en-us/office/trim-function-410388fa-c5df-49c6-b16c-9e5630b479f9>`_
268+
function to remove extra whitespace.
269+
270+
.. code-block::
271+
272+
=LEN(TRIM(A2))
273+
274+
.. include:: includes/length.rst
275+
276+
Note this will still include multiple spaces within the string, so isn't 100% equivalent.
277+
278+
279+
Finding position of substring
280+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
281+
282+
The `FIND
283+
<https://support.microsoft.com/en-us/office/find-findb-functions-c7912941-af2a-4bdf-a553-d0d89b0a0628>`_
284+
spreadsheet function returns the position of a substring, with the first character being ``1``.
285+
286+
.. image:: ../../_static/excel_sort.png
287+
:alt: Screenshot of FIND formula being used in Excel
288+
:align: center
289+
290+
.. include:: includes/find_substring.rst
291+
292+
293+
Extracting substring by position
294+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
295+
296+
Spreadsheets have a `MID
297+
<https://support.microsoft.com/en-us/office/mid-midb-functions-d5f9e25c-d7d6-472e-b568-4ecb12433028>`_
298+
formula for extracting a substring from a given position. To get the first character:
299+
300+
.. code-block::
301+
302+
=MID(A2,1,1)
303+
304+
.. include:: includes/extract_substring.rst
305+
306+
307+
Extracting nth word
308+
~~~~~~~~~~~~~~~~~~~
309+
310+
In Excel, you might use the `Text to Columns Wizard
311+
<https://support.microsoft.com/en-us/office/split-text-into-different-columns-with-the-convert-text-to-columns-wizard-30b14928-5550-41f5-97ca-7a3e9c363ed7>`_
312+
for splitting text and retrieving a specific column. (Note `it's possible to do so through a formula
313+
as well <https://exceljet.net/formula/extract-nth-word-from-text-string>`_.)
314+
315+
.. include:: includes/nth_word.rst
316+
317+
318+
Changing case
319+
~~~~~~~~~~~~~
320+
321+
Spreadsheets provide `UPPER, LOWER, and PROPER functions
322+
<https://support.microsoft.com/en-us/office/change-the-case-of-text-01481046-0fa7-4f3b-a693-496795a7a44d>`_
323+
for converting text to upper, lower, and title case, respectively.
324+
325+
.. include:: includes/case.rst
326+
327+
255328
Other considerations
256329
--------------------
257330

0 commit comments

Comments
 (0)