4
4
5
5
Comparison with SAS
6
6
********************
7
+
7
8
For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(software) >`__
8
9
this page is meant to demonstrate how different SAS operations would be
9
10
performed in pandas.
10
11
11
12
.. include :: includes/introduction.rst
12
13
13
- .. note ::
14
-
15
- Throughout this tutorial, the pandas ``DataFrame `` will be displayed by calling
16
- ``df.head() ``, which displays the first N (default 5) rows of the ``DataFrame ``.
17
- This is often used in interactive work (e.g. `Jupyter notebook
18
- <https://jupyter.org/> `_ or terminal) - the equivalent in SAS would be:
19
-
20
- .. code-block :: sas
21
-
22
- proc print data= df(obs = 5 );
23
- run;
24
14
25
15
Data structures
26
16
---------------
@@ -120,7 +110,7 @@ The pandas method is :func:`read_csv`, which works similarly.
120
110
" pandas/master/pandas/tests/io/data/csv/tips.csv"
121
111
)
122
112
tips = pd.read_csv(url)
123
- tips.head()
113
+ tips
124
114
125
115
126
116
Like ``PROC IMPORT ``, ``read_csv `` can take a number of parameters to specify
@@ -138,6 +128,19 @@ In addition to text/csv, pandas supports a variety of other data formats
138
128
such as Excel, HDF5, and SQL databases. These are all read via a ``pd.read_* ``
139
129
function. See the :ref: `IO documentation<io> ` for more details.
140
130
131
+ Limiting output
132
+ ~~~~~~~~~~~~~~~
133
+
134
+ .. include :: includes/limit.rst
135
+
136
+ The equivalent in SAS would be:
137
+
138
+ .. code-block :: sas
139
+
140
+ proc print data= df(obs = 5 );
141
+ run;
142
+
143
+
141
144
Exporting data
142
145
~~~~~~~~~~~~~~
143
146
@@ -173,20 +176,8 @@ be used on new or existing columns.
173
176
new_bill = total_bill / 2 ;
174
177
run;
175
178
176
- pandas provides similar vectorized operations by
177
- specifying the individual ``Series `` in the ``DataFrame ``.
178
- New columns can be assigned in the same way.
179
-
180
- .. ipython :: python
181
-
182
- tips[" total_bill" ] = tips[" total_bill" ] - 2
183
- tips[" new_bill" ] = tips[" total_bill" ] / 2.0
184
- tips.head()
185
-
186
- .. ipython :: python
187
- :suppress:
179
+ .. include :: includes/column_operations.rst
188
180
189
- tips = tips.drop(" new_bill" , axis = 1 )
190
181
191
182
Filtering
192
183
~~~~~~~~~
@@ -278,18 +269,7 @@ drop, and rename columns.
278
269
rename total_bill= total_bill_2;
279
270
run;
280
271
281
- The same operations are expressed in pandas below.
282
-
283
- .. ipython :: python
284
-
285
- # keep
286
- tips[[" sex" , " total_bill" , " tip" ]].head()
287
-
288
- # drop
289
- tips.drop(" sex" , axis = 1 ).head()
290
-
291
- # rename
292
- tips.rename(columns = {" total_bill" : " total_bill_2" }).head()
272
+ .. include :: includes/column_selection.rst
293
273
294
274
295
275
Sorting by values
@@ -308,8 +288,8 @@ Sorting in SAS is accomplished via ``PROC SORT``
308
288
String processing
309
289
-----------------
310
290
311
- Length
312
- ~~~~~~
291
+ Finding length of string
292
+ ~~~~~~~~~~~~~~~~~~~~~~~~
313
293
314
294
SAS determines the length of a character string with the
315
295
`LENGTHN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm >`__
@@ -327,8 +307,8 @@ functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailin
327
307
.. include :: includes/length.rst
328
308
329
309
330
- Find
331
- ~~~~
310
+ Finding position of substring
311
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
332
312
333
313
SAS determines the position of a character in a string with the
334
314
`FINDW <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm >`__ function.
@@ -342,19 +322,11 @@ you supply as the second argument.
342
322
put(FINDW(sex,' ale' ));
343
323
run;
344
324
345
- Python determines the position of a character in a string with the
346
- ``find `` function. ``find `` searches for the first position of the
347
- substring. If the substring is found, the function returns its
348
- position. Keep in mind that Python indexes are zero-based and
349
- the function will return -1 if it fails to find the substring.
350
-
351
- .. ipython :: python
352
-
353
- tips[" sex" ].str.find(" ale" ).head()
325
+ .. include :: includes/find_substring.rst
354
326
355
327
356
- Substring
357
- ~~~~~~~~~
328
+ Extracting substring by position
329
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358
330
359
331
SAS extracts a substring from a string based on its position with the
360
332
`SUBSTR <https://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf >`__ function.
@@ -366,17 +338,11 @@ SAS extracts a substring from a string based on its position with the
366
338
put(substr(sex,1 ,1 ));
367
339
run;
368
340
369
- With pandas you can use ``[] `` notation to extract a substring
370
- from a string by position locations. Keep in mind that Python
371
- indexes are zero-based.
372
-
373
- .. ipython :: python
374
-
375
- tips[" sex" ].str[0 :1 ].head()
341
+ .. include :: includes/extract_substring.rst
376
342
377
343
378
- Scan
379
- ~~~~
344
+ Extracting nth word
345
+ ~~~~~~~~~~~~~~~~~~~
380
346
381
347
The SAS `SCAN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm >`__
382
348
function returns the nth word from a string. The first argument is the string you want to parse and the
@@ -394,20 +360,11 @@ second argument specifies which word you want to extract.
394
360
;;;
395
361
run;
396
362
397
- Python extracts a substring from a string based on its text
398
- by using regular expressions. There are much more powerful
399
- approaches, but this just shows a simple approach.
363
+ .. include :: includes/nth_word.rst
400
364
401
- .. ipython :: python
402
365
403
- firstlast = pd.DataFrame({" String" : [" John Smith" , " Jane Cook" ]})
404
- firstlast[" First_Name" ] = firstlast[" String" ].str.split(" " , expand = True )[0 ]
405
- firstlast[" Last_Name" ] = firstlast[" String" ].str.rsplit(" " , expand = True )[0 ]
406
- firstlast
407
-
408
-
409
- Upcase, lowcase, and propcase
410
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
366
+ Changing case
367
+ ~~~~~~~~~~~~~
411
368
412
369
The SAS `UPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm >`__
413
370
`LOWCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm >`__ and
@@ -427,27 +384,13 @@ functions change the case of the argument.
427
384
;;;
428
385
run;
429
386
430
- The equivalent Python functions are `` upper ``, `` lower ``, and `` title ``.
387
+ .. include :: includes/case.rst
431
388
432
- .. ipython :: python
433
-
434
- firstlast = pd.DataFrame({" String" : [" John Smith" , " Jane Cook" ]})
435
- firstlast[" string_up" ] = firstlast[" String" ].str.upper()
436
- firstlast[" string_low" ] = firstlast[" String" ].str.lower()
437
- firstlast[" string_prop" ] = firstlast[" String" ].str.title()
438
- firstlast
439
389
440
390
Merging
441
391
-------
442
392
443
- The following tables will be used in the merge examples
444
-
445
- .. ipython :: python
446
-
447
- df1 = pd.DataFrame({" key" : [" A" , " B" , " C" , " D" ], " value" : np.random.randn(4 )})
448
- df1
449
- df2 = pd.DataFrame({" key" : [" B" , " D" , " D" , " E" ], " value" : np.random.randn(4 )})
450
- df2
393
+ .. include :: includes/merge_setup.rst
451
394
452
395
In SAS, data must be explicitly sorted before merging. Different
453
396
types of joins are accomplished using the ``in= `` dummy
@@ -473,39 +416,15 @@ input frames.
473
416
if a or b then output outer_join;
474
417
run;
475
418
476
- pandas DataFrames have a :meth: `~DataFrame.merge ` method, which provides
477
- similar functionality. Note that the data does not have
478
- to be sorted ahead of time, and different join
479
- types are accomplished via the ``how `` keyword.
480
-
481
- .. ipython :: python
482
-
483
- inner_join = df1.merge(df2, on = [" key" ], how = " inner" )
484
- inner_join
485
-
486
- left_join = df1.merge(df2, on = [" key" ], how = " left" )
487
- left_join
488
-
489
- right_join = df1.merge(df2, on = [" key" ], how = " right" )
490
- right_join
491
-
492
- outer_join = df1.merge(df2, on = [" key" ], how = " outer" )
493
- outer_join
419
+ .. include :: includes/merge.rst
494
420
495
421
496
422
Missing data
497
423
------------
498
424
499
- Like SAS, pandas has a representation for missing data - which is the
500
- special float value ``NaN `` (not a number). Many of the semantics
501
- are the same, for example missing data propagates through numeric
502
- operations, and is ignored by default for aggregations.
425
+ Both pandas and SAS have a representation for missing data.
503
426
504
- .. ipython :: python
505
-
506
- outer_join
507
- outer_join[" value_x" ] + outer_join[" value_y" ]
508
- outer_join[" value_x" ].sum()
427
+ .. include :: includes/missing_intro.rst
509
428
510
429
One difference is that missing data cannot be compared to its sentinel value.
511
430
For example, in SAS you could do this to filter missing values.
@@ -522,25 +441,7 @@ For example, in SAS you could do this to filter missing values.
522
441
if value_x ^= .;
523
442
run;
524
443
525
- Which doesn't work in pandas. Instead, the ``pd.isna `` or ``pd.notna `` functions
526
- should be used for comparisons.
527
-
528
- .. ipython :: python
529
-
530
- outer_join[pd.isna(outer_join[" value_x" ])]
531
- outer_join[pd.notna(outer_join[" value_x" ])]
532
-
533
- pandas also provides a variety of methods to work with missing data - some of
534
- which would be challenging to express in SAS. For example, there are methods to
535
- drop all rows with any missing values, replacing missing values with a specified
536
- value, like the mean, or forward filling from previous rows. See the
537
- :ref: `missing data documentation<missing_data> ` for more.
538
-
539
- .. ipython :: python
540
-
541
- outer_join.dropna()
542
- outer_join.fillna(method = " ffill" )
543
- outer_join[" value_x" ].fillna(outer_join[" value_x" ].mean())
444
+ .. include :: includes/missing.rst
544
445
545
446
546
447
GroupBy
@@ -549,7 +450,7 @@ GroupBy
549
450
Aggregation
550
451
~~~~~~~~~~~
551
452
552
- SAS's PROC SUMMARY can be used to group by one or
453
+ SAS's `` PROC SUMMARY `` can be used to group by one or
553
454
more key variables and compute aggregations on
554
455
numeric columns.
555
456
@@ -561,14 +462,7 @@ numeric columns.
561
462
output out= tips_summed sum = ;
562
463
run;
563
464
564
- pandas provides a flexible ``groupby `` mechanism that
565
- allows similar aggregations. See the :ref: `groupby documentation<groupby> `
566
- for more details and examples.
567
-
568
- .. ipython :: python
569
-
570
- tips_summed = tips.groupby([" sex" , " smoker" ])[[" total_bill" , " tip" ]].sum()
571
- tips_summed.head()
465
+ .. include :: includes/groupby.rst
572
466
573
467
574
468
Transformation
@@ -597,16 +491,7 @@ example, to subtract the mean for each observation by smoker group.
597
491
if a and b;
598
492
run;
599
493
600
-
601
- pandas ``groupby `` provides a ``transform `` mechanism that allows
602
- these type of operations to be succinctly expressed in one
603
- operation.
604
-
605
- .. ipython :: python
606
-
607
- gb = tips.groupby(" smoker" )[" total_bill" ]
608
- tips[" adj_total_bill" ] = tips[" total_bill" ] - gb.transform(" mean" )
609
- tips.head()
494
+ .. include :: includes/transform.rst
610
495
611
496
612
497
By group processing
0 commit comments