Skip to content

Commit a7abb0e

Browse files
committed
DOC: existence docs and benchmarks.
1 parent d01c2f5 commit a7abb0e

File tree

2 files changed

+317
-0
lines changed

2 files changed

+317
-0
lines changed

bench/bench_existence.py

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
from timeit import Timer
2+
import pandas as pd
3+
import matplotlib.pyplot as plt
4+
import os
5+
6+
7+
class Benchmarks(object):
8+
9+
def removed_time_py_list(look_for, look_in):
10+
l = range(look_in)
11+
df = pd.DataFrame(range(look_for))
12+
13+
def time_this():
14+
df[[x in l for x in df.index.values]]
15+
16+
return time_this
17+
18+
def time_py_dict(look_for, look_in):
19+
l = range(look_in)
20+
l_dict = dict(zip(l, l))
21+
df = pd.DataFrame(range(look_for))
22+
23+
def time_this():
24+
df[[x in l_dict for x in df.index.values]]
25+
26+
return time_this
27+
28+
29+
def time_isin_list(look_for, look_in):
30+
l = range(look_in)
31+
df = pd.DataFrame(range(look_for))
32+
33+
def time_this():
34+
df[df.index.isin(l)]
35+
36+
return time_this
37+
38+
39+
def time_isin_dict(look_for, look_in):
40+
l = range(look_in)
41+
l_dict = dict(zip(l, l))
42+
df = pd.DataFrame(range(look_for))
43+
44+
def time_this():
45+
df[df.index.isin(l_dict)]
46+
47+
return time_this
48+
49+
50+
def time_isin_series(look_for, look_in):
51+
l = range(look_in)
52+
l_series = pd.Series(l)
53+
df = pd.DataFrame(range(look_for))
54+
55+
def time_this():
56+
df[df.index.isin(l_series.index)]
57+
58+
return time_this
59+
60+
61+
def time_join(look_for, look_in):
62+
l = range(look_in)
63+
l_series = pd.Series(l)
64+
l_series.name = 'data'
65+
df = pd.DataFrame(range(look_for))
66+
67+
def time_this():
68+
df.join(l_series, how='inner')
69+
70+
return time_this
71+
72+
def time_query_eqeq(look_for, look_in):
73+
l = range(look_in)
74+
s = pd.Series(l)
75+
s.name = 'data'
76+
df = pd.DataFrame(range(look_for))
77+
78+
def time_this():
79+
l_series = s
80+
df.query('index == @l_series')
81+
82+
return time_this
83+
84+
def time_query_in(look_for, look_in):
85+
l = range(look_in)
86+
s = pd.Series(l)
87+
s.name = 'data'
88+
df = pd.DataFrame(range(look_for))
89+
90+
def time_this():
91+
l_series = s
92+
df.query('index in @l_series')
93+
94+
return time_this
95+
96+
97+
def run_bench(to_time, repeat, look_in, num_look_for_rows, y_limit, filename):
98+
func_results = []
99+
plt.figure()
100+
101+
for time_func_name in to_time:
102+
plot_results = []
103+
for look_for in num_look_for_rows:
104+
func = Benchmarks.__dict__[time_func_name](look_for, look_in)
105+
t = Timer(func)
106+
elapsed = t.timeit(number=repeat) / repeat
107+
name = time_func_name.replace('time_', '')
108+
func_results.append((name, look_for, look_in, elapsed))
109+
plot_results.append(elapsed)
110+
plt.plot(num_look_for_rows, plot_results, label=name)
111+
112+
plt.axes().set_xscale('log')
113+
x1,x2,y1,y2 = plt.axis()
114+
plt.axis((x1, x2, 0, y_limit))
115+
116+
plt.legend(loc=2, prop={'size':8})
117+
plt.title('Look in %s Rows' % look_in)
118+
plt.xlabel('Look For X Rows')
119+
plt.ylabel('Time(s)')
120+
plt.savefig(filename)
121+
plt.clf()
122+
123+
124+
if __name__ == '__main__':
125+
126+
pandas_dir = os.path.dirname(os.path.abspath(os.path.dirname(__file__)))
127+
static_path = os.path.join(pandas_dir, 'doc', 'source', '_static')
128+
129+
join = lambda p: os.path.join(static_path, p)
130+
131+
to_time = [key for key in Benchmarks.__dict__ if key.startswith('time_')]
132+
133+
num_look_for_rows = [10 * 2**i for i in range(1, 21)]
134+
135+
filename = join('existence-perf-small.png')
136+
run_bench(to_time, 10, 5000, num_look_for_rows[0:len(num_look_for_rows)/2], 0.004, filename)
137+
138+
filename = join('existence-perf-large.png')
139+
run_bench(to_time, 3, 5000000, num_look_for_rows[len(num_look_for_rows)/2:], 10, filename)
140+

doc/source/enhancingperf.rst

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -668,3 +668,180 @@ In general, :meth:`DataFrame.query`/:func:`pandas.eval` will
668668
evaluate the subexpressions that *can* be evaluated by ``numexpr`` and those
669669
that must be evaluated in Python space transparently to the user. This is done
670670
by inferring the result type of an expression from its arguments and operators.
671+
672+
Existence (IsIn, Inner Join, Dict/Hash, Query)
673+
----------------------------------------------------
674+
675+
There are a number of different ways to test for existence using pandas. The
676+
following methods can be used to achieve an existence test. The comments correspond
677+
to the legend in the plots further down.
678+
679+
680+
:meth:`DataFrame.isin`
681+
682+
.. code-block:: python
683+
684+
# isin_list
685+
df[df.index.isin(lst)]
686+
# isin_dict
687+
df[df.index.isin(dct)]
688+
# isin_series
689+
df[df.index.isin(series)]
690+
691+
692+
693+
:meth:`DataFrame.query`
694+
695+
.. code-block:: python
696+
697+
# query_in list
698+
df.query('index in @lst')
699+
# query_in Series
700+
df.query('index in @series')
701+
# query_in dict
702+
df.query('index in @dct')
703+
704+
# query_eqeq list
705+
df.query('index == @lst')
706+
# query_eqeq Series
707+
df.query('index == @series')
708+
709+
# dict actually throws an error with '=='
710+
711+
712+
713+
:meth:`DataFrame.apply`
714+
715+
.. code-block:: python
716+
717+
df[df.index.apply(lambda x: x in lst)]
718+
719+
720+
721+
:meth:`DataFrame.join`
722+
723+
.. code-block:: python
724+
725+
# join
726+
df.join(lst, how='inner')
727+
728+
# this can actually be fast for small DataFrames
729+
df[[x in dct for x in df.index]]
730+
731+
# isin_series, query_eqeq Series, query_in Series, pydict,
732+
# join and isin_list are included in the plots below.
733+
734+
735+
As seen below, generally using a ``Series`` is better than using pure python data
736+
structures for anything larger than very small datasets of around 1000 records.
737+
The fastest two being ``query('col == @series')`` and ``join(series)``:
738+
739+
.. code-block:: python
740+
741+
lst = range(1000000)
742+
series = Series(lst, name='data')
743+
744+
df = DataFrame(lst, columns=['ID'])
745+
746+
df.query('index == @series')
747+
# 10 loops, best of 3: 82.9 ms per loop
748+
749+
df.join(series, how='inner')
750+
# 100 loops, best of 3: 19.2 ms per loop
751+
752+
list vs Series:
753+
754+
.. code-block:: python
755+
756+
df[df.index.isin(lst)]
757+
# 1 loops, best of 3: 1.06 s per loop
758+
759+
df[df.index.isin(series)]
760+
# 1 loops, best of 3: 477 ms per loop
761+
762+
df.index vs df.column doesn't make a difference here:
763+
764+
.. code-block:: python
765+
766+
df[df.ID.isin(series)]
767+
# 1 loops, best of 3: 474 ms per loop
768+
769+
df[df.index.isin(series)]
770+
# 1 loops, best of 3: 475 ms per loop
771+
772+
The ``query`` 'in' syntax has the same performance as ``isin``, except
773+
for when using '==' with a ``Series``:
774+
775+
.. code-block:: python
776+
777+
df.query('index in @lst')
778+
# 1 loops, best of 3: 1.04 s per loop
779+
780+
df.query('index in @series')
781+
# 1 loops, best of 3: 451 ms per loop
782+
783+
df.query('index == @lst')
784+
# 1 loops, best of 3: 1.03 s per loop
785+
786+
'==' is actually quite a bit faster than 'in' when used against a Series
787+
but not as fast as ``join``.
788+
789+
.. code-block:: python
790+
791+
df.query('index == @series')
792+
# 10 loops, best of 3: 80.5 ms per loop
793+
794+
For ``join``, the data must be the index in the ``DataFrame`` and the index in the ``Series``
795+
for the best performance. The ``Series`` must also have a ``name``. ``join`` defaults to a
796+
left join so we need to specify 'inner' for existence.
797+
798+
.. code-block:: python
799+
800+
df.join(series, how='inner')
801+
# 100 loops, best of 3: 19.7 ms per loop
802+
803+
Smaller datasets:
804+
805+
.. code-block:: python
806+
807+
df = DataFrame([1,2,3,4], columns=['ID'])
808+
lst = range(10000)
809+
dct = dict(zip(lst, lst))
810+
series = Series(lst, name='data')
811+
812+
df.join(series, how='inner')
813+
# 1000 loops, best of 3: 866 us per loop
814+
815+
df[df.ID.isin(dct)]
816+
# 1000 loops, best of 3: 809 us per loop
817+
818+
df[df.ID.isin(lst)]
819+
# 1000 loops, best of 3: 853 us per loop
820+
821+
df[df.ID.isin(series)]
822+
# 100 loops, best of 3: 2.22 ms per loop
823+
824+
It's actually faster to use ``apply`` or a list comprehension for these small cases.
825+
826+
.. code-block:: python
827+
828+
df[[x in dct for x in df.ID]]
829+
# 1000 loops, best of 3: 266 us per loop
830+
831+
df[df.ID.apply(lambda x: x in dct)]
832+
# 1000 loops, best of 3: 364 us per loop
833+
834+
835+
Here is a visualization of some of the benchmarks above. You can see that except for with
836+
very small datasets, ``isin(Series)``, ``join(Series)``, and ``query('col == Series')``
837+
quickly become faster than the pure python data structures.
838+
839+
.. image:: _static/existence-perf-small.png
840+
841+
However, ``isin(Series)`` still presents fairly poor exponential performance where ``join`` is quite
842+
fast for large datasets. There is some overhead involved in ensuring your data is the index
843+
in both your left and right datasets but that time should be clearly outweighed by the gains of
844+
the join itself. For extremely large datasets, you may start bumping into memory limits since ``join``
845+
does not perform any disk chunking, etc.
846+
847+
.. image:: _static/existence-perf-large.png

0 commit comments

Comments
 (0)