Skip to content

Commit 55e624d

Browse files
committed
Merge pull request #5615 from gjreda/master
DOC: SQL to pandas comparison (#4524)
2 parents ffca57a + 8986592 commit 55e624d

File tree

4 files changed

+388
-1
lines changed

4 files changed

+388
-1
lines changed

doc/source/comparison_with_sql.rst

+380
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
.. currentmodule:: pandas
2+
.. _compare_with_sql:
3+
4+
Comparison with SQL
5+
********************
6+
Since many potential pandas users have some familiarity with
7+
`SQL <http://en.wikipedia.org/wiki/SQL>`_, this page is meant to provide some examples of how
8+
various SQL operations would be performed using pandas.
9+
10+
If you're new to pandas, you might want to first read through :ref:`10 Minutes to Pandas<10min>`
11+
to familiarize yourself with the library.
12+
13+
As is customary, we import pandas and numpy as follows:
14+
15+
.. ipython:: python
16+
17+
import pandas as pd
18+
import numpy as np
19+
20+
Most of the examples will utilize the ``tips`` dataset found within pandas tests. We'll read
21+
the data into a DataFrame called `tips` and assume we have a database table of the same name and
22+
structure.
23+
24+
.. ipython:: python
25+
26+
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'
27+
tips = pd.read_csv(url)
28+
tips.head()
29+
30+
SELECT
31+
------
32+
In SQL, selection is done using a comma-separated list of columns you'd like to select (or a ``*``
33+
to select all columns):
34+
35+
.. code-block:: sql
36+
37+
SELECT total_bill, tip, smoker, time
38+
FROM tips
39+
LIMIT 5;
40+
41+
With pandas, column selection is done by passing a list of column names to your DataFrame:
42+
43+
.. ipython:: python
44+
45+
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
46+
47+
Calling the DataFrame without the list of column names would display all columns (akin to SQL's
48+
``*``).
49+
50+
WHERE
51+
-----
52+
Filtering in SQL is done via a WHERE clause.
53+
54+
.. code-block:: sql
55+
56+
SELECT *
57+
FROM tips
58+
WHERE time = 'Dinner'
59+
LIMIT 5;
60+
61+
DataFrames can be filtered in multiple ways; the most intuitive of which is using
62+
`boolean indexing <http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing>`_.
63+
64+
.. ipython:: python
65+
66+
tips[tips['time'] == 'Dinner'].head(5)
67+
68+
The above statement is simply passing a ``Series`` of True/False objects to the DataFrame,
69+
returning all rows with True.
70+
71+
.. ipython:: python
72+
73+
is_dinner = tips['time'] == 'Dinner'
74+
is_dinner.value_counts()
75+
tips[is_dinner].head(5)
76+
77+
Just like SQL's OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and &
78+
(AND).
79+
80+
.. code-block:: sql
81+
82+
-- tips of more than $5.00 at Dinner meals
83+
SELECT *
84+
FROM tips
85+
WHERE time = 'Dinner' AND tip > 5.00;
86+
87+
.. ipython:: python
88+
89+
# tips of more than $5.00 at Dinner meals
90+
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
91+
92+
.. code-block:: sql
93+
94+
-- tips by parties of at least 5 diners OR bill total was more than $45
95+
SELECT *
96+
FROM tips
97+
WHERE size >= 5 OR total_bill > 45;
98+
99+
.. ipython:: python
100+
101+
# tips by parties of at least 5 diners OR bill total was more than $45
102+
tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
103+
104+
NULL checking is done using the :meth:`~pandas.Series.notnull` and :meth:`~pandas.Series.isnull`
105+
methods.
106+
107+
.. ipython:: python
108+
109+
frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
110+
'col2': ['F', np.NaN, 'G', 'H', 'I']})
111+
frame
112+
113+
Assume we have a table of the same structure as our DataFrame above. We can see only the records
114+
where ``col2`` IS NULL with the following query:
115+
116+
.. code-block:: sql
117+
118+
SELECT *
119+
FROM frame
120+
WHERE col2 IS NULL;
121+
122+
.. ipython:: python
123+
124+
frame[frame['col2'].isnull()]
125+
126+
Getting items where ``col1`` IS NOT NULL can be done with :meth:`~pandas.Series.notnull`.
127+
128+
.. code-block:: sql
129+
130+
SELECT *
131+
FROM frame
132+
WHERE col1 IS NOT NULL;
133+
134+
.. ipython:: python
135+
136+
frame[frame['col1'].notnull()]
137+
138+
139+
GROUP BY
140+
--------
141+
In pandas, SQL's GROUP BY operations performed using the similarly named
142+
:meth:`~pandas.DataFrame.groupby` method. :meth:`~pandas.DataFrame.groupby` typically refers to a
143+
process where we'd like to split a dataset into groups, apply some function (typically aggregation)
144+
, and then combine the groups together.
145+
146+
A common SQL operation would be getting the count of records in each group throughout a dataset.
147+
For instance, a query getting us the number of tips left by sex:
148+
149+
.. code-block:: sql
150+
151+
SELECT sex, count(*)
152+
FROM tips
153+
GROUP BY sex;
154+
/*
155+
Female 87
156+
Male 157
157+
*/
158+
159+
160+
The pandas equivalent would be:
161+
162+
.. ipython:: python
163+
164+
tips.groupby('sex').size()
165+
166+
Notice that in the pandas code we used :meth:`~pandas.DataFrameGroupBy.size` and not
167+
:meth:`~pandas.DataFrameGroupBy.count`. This is because :meth:`~pandas.DataFrameGroupBy.count`
168+
applies the function to each column, returning the number of ``not null`` records within each.
169+
170+
.. ipython:: python
171+
172+
tips.groupby('sex').count()
173+
174+
Alternatively, we could have applied the :meth:`~pandas.DataFrameGroupBy.count` method to an
175+
individual column:
176+
177+
.. ipython:: python
178+
179+
tips.groupby('sex')['total_bill'].count()
180+
181+
Multiple functions can also be applied at once. For instance, say we'd like to see how tip amount
182+
differs by day of the week - :meth:`~pandas.DataFrameGroupBy.agg` allows you to pass a dictionary
183+
to your grouped DataFrame, indicating which functions to apply to specific columns.
184+
185+
.. code-block:: sql
186+
187+
SELECT day, AVG(tip), COUNT(*)
188+
FROM tips
189+
GROUP BY day;
190+
/*
191+
Fri 2.734737 19
192+
Sat 2.993103 87
193+
Sun 3.255132 76
194+
Thur 2.771452 62
195+
*/
196+
197+
.. ipython:: python
198+
199+
tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
200+
201+
Grouping by more than one column is done by passing a list of columns to the
202+
:meth:`~pandas.DataFrame.groupby` method.
203+
204+
.. code-block:: sql
205+
206+
SELECT smoker, day, COUNT(*), AVG(tip)
207+
FROM tip
208+
GROUP BY smoker, day;
209+
/*
210+
smoker day
211+
No Fri 4 2.812500
212+
Sat 45 3.102889
213+
Sun 57 3.167895
214+
Thur 45 2.673778
215+
Yes Fri 15 2.714000
216+
Sat 42 2.875476
217+
Sun 19 3.516842
218+
Thur 17 3.030000
219+
*/
220+
221+
.. ipython:: python
222+
223+
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
224+
225+
.. _compare_with_sql.join:
226+
227+
JOIN
228+
----
229+
JOINs can be performed with :meth:`~pandas.DataFrame.join` or :meth:`~pandas.merge`. By default,
230+
:meth:`~pandas.DataFrame.join` will join the DataFrames on their indices. Each method has
231+
parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the
232+
columns to join on (column names or indices).
233+
234+
.. ipython:: python
235+
236+
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
237+
'value': np.random.randn(4)})
238+
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
239+
'value': np.random.randn(4)})
240+
241+
Assume we have two database tables of the same name and structure as our DataFrames.
242+
243+
Now let's go over the various types of JOINs.
244+
245+
INNER JOIN
246+
~~~~~~~~~~
247+
.. code-block:: sql
248+
249+
SELECT *
250+
FROM df1
251+
INNER JOIN df2
252+
ON df1.key = df2.key;
253+
254+
.. ipython:: python
255+
256+
# merge performs an INNER JOIN by default
257+
pd.merge(df1, df2, on='key')
258+
259+
:meth:`~pandas.merge` also offers parameters for cases when you'd like to join one DataFrame's
260+
column with another DataFrame's index.
261+
262+
.. ipython:: python
263+
264+
indexed_df2 = df2.set_index('key')
265+
pd.merge(df1, indexed_df2, left_on='key', right_index=True)
266+
267+
LEFT OUTER JOIN
268+
~~~~~~~~~~~~~~~
269+
.. code-block:: sql
270+
271+
-- show all records from df1
272+
SELECT *
273+
FROM df1
274+
LEFT OUTER JOIN df2
275+
ON df1.key = df2.key;
276+
277+
.. ipython:: python
278+
279+
# show all records from df1
280+
pd.merge(df1, df2, on='key', how='left')
281+
282+
RIGHT JOIN
283+
~~~~~~~~~~
284+
.. code-block:: sql
285+
286+
-- show all records from df2
287+
SELECT *
288+
FROM df1
289+
RIGHT OUTER JOIN df2
290+
ON df1.key = df2.key;
291+
292+
.. ipython:: python
293+
294+
# show all records from df2
295+
pd.merge(df1, df2, on='key', how='right')
296+
297+
FULL JOIN
298+
~~~~~~~~~
299+
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the
300+
joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
301+
302+
.. code-block:: sql
303+
304+
-- show all records from both tables
305+
SELECT *
306+
FROM df1
307+
FULL OUTER JOIN df2
308+
ON df1.key = df2.key;
309+
310+
.. ipython:: python
311+
312+
# show all records from both frames
313+
pd.merge(df1, df2, on='key', how='outer')
314+
315+
316+
UNION
317+
-----
318+
UNION ALL can be performed using :meth:`~pandas.concat`.
319+
320+
.. ipython:: python
321+
322+
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
323+
'rank': range(1, 4)})
324+
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
325+
'rank': [1, 4, 5]})
326+
327+
.. code-block:: sql
328+
329+
SELECT city, rank
330+
FROM df1
331+
UNION ALL
332+
SELECT city, rank
333+
FROM df2;
334+
/*
335+
city rank
336+
Chicago 1
337+
San Francisco 2
338+
New York City 3
339+
Chicago 1
340+
Boston 4
341+
Los Angeles 5
342+
*/
343+
344+
.. ipython:: python
345+
346+
pd.concat([df1, df2])
347+
348+
SQL's UNION is similar to UNION ALL, however UNION will remove duplicate rows.
349+
350+
.. code-block:: sql
351+
352+
SELECT city, rank
353+
FROM df1
354+
UNION
355+
SELECT city, rank
356+
FROM df2;
357+
-- notice that there is only one Chicago record this time
358+
/*
359+
city rank
360+
Chicago 1
361+
San Francisco 2
362+
New York City 3
363+
Boston 4
364+
Los Angeles 5
365+
*/
366+
367+
In pandas, you can use :meth:`~pandas.concat` in conjunction with
368+
:meth:`~pandas.DataFrame.drop_duplicates`.
369+
370+
.. ipython:: python
371+
372+
pd.concat([df1, df2]).drop_duplicates()
373+
374+
375+
UPDATE
376+
------
377+
378+
379+
DELETE
380+
------

doc/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -132,5 +132,6 @@ See the package overview for more detail about what's in the library.
132132
r_interface
133133
related
134134
comparison_with_r
135+
comparison_with_sql
135136
api
136137
release

0 commit comments

Comments
 (0)