Skip to content

Commit e2f83df

Browse files
committed
Merge remote-tracking branch 'upstream/master' into move-metadata-to-cfg
2 parents b25bf90 + 8f26de1 commit e2f83df

31 files changed

+354
-472
lines changed

doc/source/getting_started/comparison/comparison_with_sas.rst

+39-154
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,13 @@
44

55
Comparison with SAS
66
********************
7+
78
For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__
89
this page is meant to demonstrate how different SAS operations would be
910
performed in pandas.
1011

1112
.. include:: includes/introduction.rst
1213

13-
.. note::
14-
15-
Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling
16-
``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``.
17-
This is often used in interactive work (e.g. `Jupyter notebook
18-
<https://jupyter.org/>`_ or terminal) - the equivalent in SAS would be:
19-
20-
.. code-block:: sas
21-
22-
proc print data=df(obs=5);
23-
run;
2414

2515
Data structures
2616
---------------
@@ -120,7 +110,7 @@ The pandas method is :func:`read_csv`, which works similarly.
120110
"pandas/master/pandas/tests/io/data/csv/tips.csv"
121111
)
122112
tips = pd.read_csv(url)
123-
tips.head()
113+
tips
124114
125115
126116
Like ``PROC IMPORT``, ``read_csv`` can take a number of parameters to specify
@@ -138,6 +128,19 @@ In addition to text/csv, pandas supports a variety of other data formats
138128
such as Excel, HDF5, and SQL databases. These are all read via a ``pd.read_*``
139129
function. See the :ref:`IO documentation<io>` for more details.
140130

131+
Limiting output
132+
~~~~~~~~~~~~~~~
133+
134+
.. include:: includes/limit.rst
135+
136+
The equivalent in SAS would be:
137+
138+
.. code-block:: sas
139+
140+
proc print data=df(obs=5);
141+
run;
142+
143+
141144
Exporting data
142145
~~~~~~~~~~~~~~
143146

@@ -173,20 +176,8 @@ be used on new or existing columns.
173176
new_bill = total_bill / 2;
174177
run;
175178
176-
pandas provides similar vectorized operations by
177-
specifying the individual ``Series`` in the ``DataFrame``.
178-
New columns can be assigned in the same way.
179-
180-
.. ipython:: python
181-
182-
tips["total_bill"] = tips["total_bill"] - 2
183-
tips["new_bill"] = tips["total_bill"] / 2.0
184-
tips.head()
185-
186-
.. ipython:: python
187-
:suppress:
179+
.. include:: includes/column_operations.rst
188180

189-
tips = tips.drop("new_bill", axis=1)
190181

191182
Filtering
192183
~~~~~~~~~
@@ -278,18 +269,7 @@ drop, and rename columns.
278269
rename total_bill=total_bill_2;
279270
run;
280271
281-
The same operations are expressed in pandas below.
282-
283-
.. ipython:: python
284-
285-
# keep
286-
tips[["sex", "total_bill", "tip"]].head()
287-
288-
# drop
289-
tips.drop("sex", axis=1).head()
290-
291-
# rename
292-
tips.rename(columns={"total_bill": "total_bill_2"}).head()
272+
.. include:: includes/column_selection.rst
293273

294274

295275
Sorting by values
@@ -308,8 +288,8 @@ Sorting in SAS is accomplished via ``PROC SORT``
308288
String processing
309289
-----------------
310290

311-
Length
312-
~~~~~~
291+
Finding length of string
292+
~~~~~~~~~~~~~~~~~~~~~~~~
313293

314294
SAS determines the length of a character string with the
315295
`LENGTHN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm>`__
@@ -327,8 +307,8 @@ functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailin
327307
.. include:: includes/length.rst
328308

329309

330-
Find
331-
~~~~
310+
Finding position of substring
311+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
332312

333313
SAS determines the position of a character in a string with the
334314
`FINDW <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm>`__ function.
@@ -342,19 +322,11 @@ you supply as the second argument.
342322
put(FINDW(sex,'ale'));
343323
run;
344324
345-
Python determines the position of a character in a string with the
346-
``find`` function. ``find`` searches for the first position of the
347-
substring. If the substring is found, the function returns its
348-
position. Keep in mind that Python indexes are zero-based and
349-
the function will return -1 if it fails to find the substring.
350-
351-
.. ipython:: python
352-
353-
tips["sex"].str.find("ale").head()
325+
.. include:: includes/find_substring.rst
354326

355327

356-
Substring
357-
~~~~~~~~~
328+
Extracting substring by position
329+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358330

359331
SAS extracts a substring from a string based on its position with the
360332
`SUBSTR <https://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf>`__ function.
@@ -366,17 +338,11 @@ SAS extracts a substring from a string based on its position with the
366338
put(substr(sex,1,1));
367339
run;
368340
369-
With pandas you can use ``[]`` notation to extract a substring
370-
from a string by position locations. Keep in mind that Python
371-
indexes are zero-based.
372-
373-
.. ipython:: python
374-
375-
tips["sex"].str[0:1].head()
341+
.. include:: includes/extract_substring.rst
376342

377343

378-
Scan
379-
~~~~
344+
Extracting nth word
345+
~~~~~~~~~~~~~~~~~~~
380346

381347
The SAS `SCAN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm>`__
382348
function returns the nth word from a string. The first argument is the string you want to parse and the
@@ -394,20 +360,11 @@ second argument specifies which word you want to extract.
394360
;;;
395361
run;
396362
397-
Python extracts a substring from a string based on its text
398-
by using regular expressions. There are much more powerful
399-
approaches, but this just shows a simple approach.
363+
.. include:: includes/nth_word.rst
400364

401-
.. ipython:: python
402365

403-
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
404-
firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]
405-
firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0]
406-
firstlast
407-
408-
409-
Upcase, lowcase, and propcase
410-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
366+
Changing case
367+
~~~~~~~~~~~~~
411368

412369
The SAS `UPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm>`__
413370
`LOWCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm>`__ and
@@ -427,27 +384,13 @@ functions change the case of the argument.
427384
;;;
428385
run;
429386
430-
The equivalent Python functions are ``upper``, ``lower``, and ``title``.
387+
.. include:: includes/case.rst
431388

432-
.. ipython:: python
433-
434-
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
435-
firstlast["string_up"] = firstlast["String"].str.upper()
436-
firstlast["string_low"] = firstlast["String"].str.lower()
437-
firstlast["string_prop"] = firstlast["String"].str.title()
438-
firstlast
439389

440390
Merging
441391
-------
442392

443-
The following tables will be used in the merge examples
444-
445-
.. ipython:: python
446-
447-
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
448-
df1
449-
df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
450-
df2
393+
.. include:: includes/merge_setup.rst
451394

452395
In SAS, data must be explicitly sorted before merging. Different
453396
types of joins are accomplished using the ``in=`` dummy
@@ -473,39 +416,15 @@ input frames.
473416
if a or b then output outer_join;
474417
run;
475418
476-
pandas DataFrames have a :meth:`~DataFrame.merge` method, which provides
477-
similar functionality. Note that the data does not have
478-
to be sorted ahead of time, and different join
479-
types are accomplished via the ``how`` keyword.
480-
481-
.. ipython:: python
482-
483-
inner_join = df1.merge(df2, on=["key"], how="inner")
484-
inner_join
485-
486-
left_join = df1.merge(df2, on=["key"], how="left")
487-
left_join
488-
489-
right_join = df1.merge(df2, on=["key"], how="right")
490-
right_join
491-
492-
outer_join = df1.merge(df2, on=["key"], how="outer")
493-
outer_join
419+
.. include:: includes/merge.rst
494420

495421

496422
Missing data
497423
------------
498424

499-
Like SAS, pandas has a representation for missing data - which is the
500-
special float value ``NaN`` (not a number). Many of the semantics
501-
are the same, for example missing data propagates through numeric
502-
operations, and is ignored by default for aggregations.
425+
Both pandas and SAS have a representation for missing data.
503426

504-
.. ipython:: python
505-
506-
outer_join
507-
outer_join["value_x"] + outer_join["value_y"]
508-
outer_join["value_x"].sum()
427+
.. include:: includes/missing_intro.rst
509428

510429
One difference is that missing data cannot be compared to its sentinel value.
511430
For example, in SAS you could do this to filter missing values.
@@ -522,25 +441,7 @@ For example, in SAS you could do this to filter missing values.
522441
if value_x ^= .;
523442
run;
524443
525-
Which doesn't work in pandas. Instead, the ``pd.isna`` or ``pd.notna`` functions
526-
should be used for comparisons.
527-
528-
.. ipython:: python
529-
530-
outer_join[pd.isna(outer_join["value_x"])]
531-
outer_join[pd.notna(outer_join["value_x"])]
532-
533-
pandas also provides a variety of methods to work with missing data - some of
534-
which would be challenging to express in SAS. For example, there are methods to
535-
drop all rows with any missing values, replacing missing values with a specified
536-
value, like the mean, or forward filling from previous rows. See the
537-
:ref:`missing data documentation<missing_data>` for more.
538-
539-
.. ipython:: python
540-
541-
outer_join.dropna()
542-
outer_join.fillna(method="ffill")
543-
outer_join["value_x"].fillna(outer_join["value_x"].mean())
444+
.. include:: includes/missing.rst
544445

545446

546447
GroupBy
@@ -549,7 +450,7 @@ GroupBy
549450
Aggregation
550451
~~~~~~~~~~~
551452

552-
SAS's PROC SUMMARY can be used to group by one or
453+
SAS's ``PROC SUMMARY`` can be used to group by one or
553454
more key variables and compute aggregations on
554455
numeric columns.
555456

@@ -561,14 +462,7 @@ numeric columns.
561462
output out=tips_summed sum=;
562463
run;
563464
564-
pandas provides a flexible ``groupby`` mechanism that
565-
allows similar aggregations. See the :ref:`groupby documentation<groupby>`
566-
for more details and examples.
567-
568-
.. ipython:: python
569-
570-
tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()
571-
tips_summed.head()
465+
.. include:: includes/groupby.rst
572466

573467

574468
Transformation
@@ -597,16 +491,7 @@ example, to subtract the mean for each observation by smoker group.
597491
if a and b;
598492
run;
599493
600-
601-
pandas ``groupby`` provides a ``transform`` mechanism that allows
602-
these type of operations to be succinctly expressed in one
603-
operation.
604-
605-
.. ipython:: python
606-
607-
gb = tips.groupby("smoker")["total_bill"]
608-
tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean")
609-
tips.head()
494+
.. include:: includes/transform.rst
610495

611496

612497
By group processing

0 commit comments

Comments
 (0)