Comparison with SPSS

For potential users coming from SPSS, this page is meant to demonstrate how various SPSS operations would be performed using pandas.

Data structures

General terminology translation

pandas	SPSS
`DataFrame`	data file
column	variable
row	case
groupby	split file
`NaN`	system-missing

`DataFrame`

A DataFrame in pandas is analogous to an SPSS data file - a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be performed in SPSS can also be accomplished in pandas.

`Series`

A Series is the data structure that represents one column of a DataFrame. SPSS doesn't have a separate data structure for a single variable, but in general, working with a Series is analogous to working with a variable in SPSS.

`Index`

Every DataFrame and Series has an Index -- labels on the rows of the data. SPSS does not have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is specified, a RangeIndex is used by default (first row = 0, second row = 1, and so on).

While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as a collection of columns. Please see the :ref:`indexing documentation<indexing>` for much more on how to use an Index effectively.

Copies vs. in place operations

Data input / output

Reading external data

Like SPSS, pandas provides utilities for reading in data from many formats. The tips dataset, found within the pandas tests (csv) will be used in many of the following examples.

In SPSS, you would use File > Open > Data to import a CSV file:

FILE > OPEN > DATA
/TYPE=CSV
/FILE='tips.csv'
/DELIMITERS=","
/FIRSTCASE=2
/VARIABLES=col1 col2 col3.

The pandas equivalent would use :func:`read_csv`:

.. ipython:: python

   url = (
       "https://raw.githubusercontent.com/pandas-dev"
       "/pandas/main/pandas/tests/io/data/csv/tips.csv"
   )
   tips = pd.read_csv(url)
   tips

Like SPSS's data import wizard, read_csv can take a number of parameters to specify how the data should be parsed. For example, if the data was instead tab delimited, and did not have column names, the pandas command would be:

tips = pd.read_csv("tips.csv", sep="\t", header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter
tips = pd.read_table("tips.csv", header=None)

Data operations

Filtering

In SPSS, filtering is done through Data > Select Cases:

SELECT IF (total_bill > 10).
EXECUTE.

In pandas, boolean indexing can be used:

.. ipython:: python

    tips[tips["total_bill"] > 10]

Sorting

In SPSS, sorting is done through Data > Sort Cases:

SORT CASES BY sex total_bill.
EXECUTE.

In pandas, this would be written as:

.. ipython:: python

    tips.sort_values(["sex", "total_bill"])

String processing

Finding length of string

In SPSS:

COMPUTE length = LENGTH(time).
EXECUTE.

Changing case

In SPSS:

COMPUTE upper = UPCASE(time).
COMPUTE lower = LOWER(time).
EXECUTE.

Merging

In SPSS, merging data files is done through Data > Merge Files.

GroupBy operations

Split-file processing

In SPSS, split-file analysis is done through Data > Split File:

SORT CASES BY sex.
SPLIT FILE BY sex.
DESCRIPTIVES VARIABLES=total_bill tip
  /STATISTICS=MEAN STDDEV MIN MAX.

The pandas equivalent would be:

.. ipython:: python

    tips.groupby("sex")[["total_bill", "tip"]].agg(["mean", "std", "min", "max"])

Missing data

SPSS uses the period (.) for numeric missing values and blank spaces for string missing values. pandas uses NaN (Not a Number) for numeric missing values and None or NaN for string missing values.

Other considerations

Output management

While pandas does not have a direct equivalent to SPSS's Output Management System (OMS), you can capture and export results in various ways:

# Save summary statistics to CSV
tips.groupby('sex')[['total_bill', 'tip']].mean().to_csv('summary.csv')

# Save multiple results to Excel sheets
with pd.ExcelWriter('results.xlsx') as writer:
    tips.describe().to_excel(writer, sheet_name='Descriptives')
    tips.groupby('sex').mean().to_excel(writer, sheet_name='Means by Gender')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comparison_with_spss.rst

comparison_with_spss.rst

Comparison with SPSS

Data structures

General terminology translation

`DataFrame`

`Series`

`Index`

Copies vs. in place operations

Data input / output

Reading external data

Data operations

Filtering

Sorting

String processing

Finding length of string

Changing case

Merging

GroupBy operations

Split-file processing

Missing data

Other considerations

Output management

Files

comparison_with_spss.rst

Latest commit

History

comparison_with_spss.rst

File metadata and controls

Comparison with SPSS

Data structures

General terminology translation

DataFrame

Series

Index

Copies vs. in place operations

Data input / output

Reading external data

Data operations

Filtering

Sorting

String processing

Finding length of string

Changing case

Merging

GroupBy operations

Split-file processing

Missing data

Other considerations

Output management

`DataFrame`

`Series`

`Index`