Skip to content

Commit fef01c5

Browse files
Jacob-Lazarjl_win_a
and
jl_win_a
authored
DOC: add SPSS comparison guide structure (pandas-dev#60738)
* DOC: add SPSS comparison guide structure - Create SPSS comparison documentation - Add header and introduction sections - Terminology translation table - Create template for common operations comparison Part of pandas-dev#60727 * DOC: edit SPSS comparison guide to documentation - Added file to doc/source/getting_started/comparison/index.rst toctree - Fixed formatting and whitespace issues to meet documentation standards * DOC: edit minor whitespaces in SPSS comparison guide * DOC: standardize class references in SPSS guide * DOC: Fix RST section underline lengths in SPSS comparison --------- Co-authored-by: jl_win_a <jl@win-a>
1 parent b60e222 commit fef01c5

File tree

2 files changed

+230
-0
lines changed

2 files changed

+230
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
.. _compare_with_spss:
2+
3+
{{ header }}
4+
5+
Comparison with SPSS
6+
********************
7+
For potential users coming from `SPSS <https://www.ibm.com/spss>`__, this page is meant to demonstrate
8+
how various SPSS operations would be performed using pandas.
9+
10+
.. include:: includes/introduction.rst
11+
12+
Data structures
13+
---------------
14+
15+
General terminology translation
16+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17+
18+
.. csv-table::
19+
:header: "pandas", "SPSS"
20+
:widths: 20, 20
21+
22+
:class:`DataFrame`, data file
23+
column, variable
24+
row, case
25+
groupby, split file
26+
:class:`NaN`, system-missing
27+
28+
:class:`DataFrame`
29+
~~~~~~~~~~~~~~~~~~
30+
31+
A :class:`DataFrame` in pandas is analogous to an SPSS data file - a two-dimensional
32+
data source with labeled columns that can be of different types. As will be shown in this
33+
document, almost any operation that can be performed in SPSS can also be accomplished in pandas.
34+
35+
:class:`Series`
36+
~~~~~~~~~~~~~~~
37+
38+
A :class:`Series` is the data structure that represents one column of a :class:`DataFrame`. SPSS doesn't have a
39+
separate data structure for a single variable, but in general, working with a :class:`Series` is analogous
40+
to working with a variable in SPSS.
41+
42+
:class:`Index`
43+
~~~~~~~~~~~~~~
44+
45+
Every :class:`DataFrame` and :class:`Series` has an :class:`Index` -- labels on the *rows* of the data. SPSS does not
46+
have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is
47+
specified, a :class:`RangeIndex` is used by default (first row = 0, second row = 1, and so on).
48+
49+
While using a labeled :class:`Index` or :class:`MultiIndex` can enable sophisticated analyses and is ultimately an
50+
important part of pandas to understand, for this comparison we will essentially ignore the :class:`Index` and
51+
just treat the :class:`DataFrame` as a collection of columns. Please see the :ref:`indexing documentation<indexing>`
52+
for much more on how to use an :class:`Index` effectively.
53+
54+
55+
Copies vs. in place operations
56+
------------------------------
57+
58+
.. include:: includes/copies.rst
59+
60+
61+
Data input / output
62+
-------------------
63+
64+
Reading external data
65+
~~~~~~~~~~~~~~~~~~~~~
66+
67+
Like SPSS, pandas provides utilities for reading in data from many formats. The ``tips`` dataset, found within
68+
the pandas tests (`csv <https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv>`_)
69+
will be used in many of the following examples.
70+
71+
In SPSS, you would use File > Open > Data to import a CSV file:
72+
73+
.. code-block:: text
74+
75+
FILE > OPEN > DATA
76+
/TYPE=CSV
77+
/FILE='tips.csv'
78+
/DELIMITERS=","
79+
/FIRSTCASE=2
80+
/VARIABLES=col1 col2 col3.
81+
82+
The pandas equivalent would use :func:`read_csv`:
83+
84+
.. code-block:: python
85+
86+
url = (
87+
"https://raw.githubusercontent.com/pandas-dev"
88+
"/pandas/main/pandas/tests/io/data/csv/tips.csv"
89+
)
90+
tips = pd.read_csv(url)
91+
tips
92+
93+
Like SPSS's data import wizard, ``read_csv`` can take a number of parameters to specify how the data should be parsed.
94+
For example, if the data was instead tab delimited, and did not have column names, the pandas command would be:
95+
96+
.. code-block:: python
97+
98+
tips = pd.read_csv("tips.csv", sep="\t", header=None)
99+
100+
# alternatively, read_table is an alias to read_csv with tab delimiter
101+
tips = pd.read_table("tips.csv", header=None)
102+
103+
104+
Data operations
105+
---------------
106+
107+
Filtering
108+
~~~~~~~~~
109+
110+
In SPSS, filtering is done through Data > Select Cases:
111+
112+
.. code-block:: text
113+
114+
SELECT IF (total_bill > 10).
115+
EXECUTE.
116+
117+
In pandas, boolean indexing can be used:
118+
119+
.. code-block:: python
120+
121+
tips[tips["total_bill"] > 10]
122+
123+
124+
Sorting
125+
~~~~~~~
126+
127+
In SPSS, sorting is done through Data > Sort Cases:
128+
129+
.. code-block:: text
130+
131+
SORT CASES BY sex total_bill.
132+
EXECUTE.
133+
134+
In pandas, this would be written as:
135+
136+
.. code-block:: python
137+
138+
tips.sort_values(["sex", "total_bill"])
139+
140+
141+
String processing
142+
-----------------
143+
144+
Finding length of string
145+
~~~~~~~~~~~~~~~~~~~~~~~~
146+
147+
In SPSS:
148+
149+
.. code-block:: text
150+
151+
COMPUTE length = LENGTH(time).
152+
EXECUTE.
153+
154+
.. include:: includes/length.rst
155+
156+
157+
Changing case
158+
~~~~~~~~~~~~~
159+
160+
In SPSS:
161+
162+
.. code-block:: text
163+
164+
COMPUTE upper = UPCASE(time).
165+
COMPUTE lower = LOWER(time).
166+
EXECUTE.
167+
168+
.. include:: includes/case.rst
169+
170+
171+
Merging
172+
-------
173+
174+
In SPSS, merging data files is done through Data > Merge Files.
175+
176+
.. include:: includes/merge_setup.rst
177+
.. include:: includes/merge.rst
178+
179+
180+
GroupBy operations
181+
------------------
182+
183+
Split-file processing
184+
~~~~~~~~~~~~~~~~~~~~~
185+
186+
In SPSS, split-file analysis is done through Data > Split File:
187+
188+
.. code-block:: text
189+
190+
SORT CASES BY sex.
191+
SPLIT FILE BY sex.
192+
DESCRIPTIVES VARIABLES=total_bill tip
193+
/STATISTICS=MEAN STDDEV MIN MAX.
194+
195+
The pandas equivalent would be:
196+
197+
.. code-block:: python
198+
199+
tips.groupby("sex")[["total_bill", "tip"]].agg(["mean", "std", "min", "max"])
200+
201+
202+
Missing data
203+
------------
204+
205+
SPSS uses the period (``.``) for numeric missing values and blank spaces for string missing values.
206+
pandas uses ``NaN`` (Not a Number) for numeric missing values and ``None`` or ``NaN`` for string
207+
missing values.
208+
209+
.. include:: includes/missing.rst
210+
211+
212+
Other considerations
213+
--------------------
214+
215+
Output management
216+
-----------------
217+
218+
While pandas does not have a direct equivalent to SPSS's Output Management System (OMS), you can
219+
capture and export results in various ways:
220+
221+
.. code-block:: python
222+
223+
# Save summary statistics to CSV
224+
tips.groupby('sex')[['total_bill', 'tip']].mean().to_csv('summary.csv')
225+
226+
# Save multiple results to Excel sheets
227+
with pd.ExcelWriter('results.xlsx') as writer:
228+
tips.describe().to_excel(writer, sheet_name='Descriptives')
229+
tips.groupby('sex').mean().to_excel(writer, sheet_name='Means by Gender')

doc/source/getting_started/comparison/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@ Comparison with other tools
1414
comparison_with_spreadsheets
1515
comparison_with_sas
1616
comparison_with_stata
17+
comparison_with_spss

0 commit comments

Comments
 (0)