Skip to content

Commit cf284ce

Browse files
committed
DOC: existence docs and benchmarks.
1 parent a7abb0e commit cf284ce

File tree

2 files changed

+31
-39
lines changed

2 files changed

+31
-39
lines changed

bench/bench_existence.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -68,18 +68,19 @@ def time_this():
6868
df.join(l_series, how='inner')
6969

7070
return time_this
71-
72-
def time_query_eqeq(look_for, look_in):
73-
l = range(look_in)
74-
s = pd.Series(l)
75-
s.name = 'data'
76-
df = pd.DataFrame(range(look_for))
77-
78-
def time_this():
79-
l_series = s
80-
df.query('index == @l_series')
8171

82-
return time_this
72+
# Removed. This functionality might be a bug in query('.. == ..').
73+
# def time_query_eqeq(look_for, look_in):
74+
# l = range(look_in)
75+
# s = pd.Series(l)
76+
# s.name = 'data'
77+
# df = pd.DataFrame(range(look_for))
78+
79+
# def time_this():
80+
# l_series = s
81+
# df.query('index == @l_series')
82+
83+
# return time_this
8384

8485
def time_query_in(look_for, look_in):
8586
l = range(look_in)

doc/source/enhancingperf.rst

Lines changed: 19 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -672,9 +672,13 @@ by inferring the result type of an expression from its arguments and operators.
672672
Existence (IsIn, Inner Join, Dict/Hash, Query)
673673
----------------------------------------------------
674674

675-
There are a number of different ways to test for existence using pandas. The
676-
following methods can be used to achieve an existence test. The comments correspond
677-
to the legend in the plots further down.
675+
Existence is the process of testing if an item exists in another list of items, and
676+
in the case of a DataFrame, we're testing each value of a column for existence in
677+
another collection of items.
678+
679+
There are a number of different ways to test for existence using pandas and the
680+
following methods are a few of those. The comments correspond to the legend
681+
in the plots further down.
678682

679683

680684
:meth:`DataFrame.isin`
@@ -694,20 +698,19 @@ to the legend in the plots further down.
694698

695699
.. code-block:: python
696700
701+
# The '@' symbol is used with `query` to reference local variables. Names
702+
# without '@' will reference the DataFrame's columns or index.
703+
697704
# query_in list
698705
df.query('index in @lst')
699706
# query_in Series
700707
df.query('index in @series')
701-
# query_in dict
702-
df.query('index in @dct')
703708
704-
# query_eqeq list
705-
df.query('index == @lst')
706-
# query_eqeq Series
707-
df.query('index == @series')
708-
709-
# dict actually throws an error with '=='
709+
# A list can be used with `query('.. == ..')` to test for existence
710+
# but other data structures such as the `pandas.Series` have
711+
# a different behaviour.
710712
713+
df.query('index == @lst')
711714
712715
713716
:meth:`DataFrame.apply`
@@ -717,7 +720,6 @@ to the legend in the plots further down.
717720
df[df.index.apply(lambda x: x in lst)]
718721
719722
720-
721723
:meth:`DataFrame.join`
722724

723725
.. code-block:: python
@@ -728,13 +730,13 @@ to the legend in the plots further down.
728730
# this can actually be fast for small DataFrames
729731
df[[x in dct for x in df.index]]
730732
731-
# isin_series, query_eqeq Series, query_in Series, pydict,
733+
# isin_series, query_in Series, pydict,
732734
# join and isin_list are included in the plots below.
733735
734736
735737
As seen below, generally using a ``Series`` is better than using pure python data
736738
structures for anything larger than very small datasets of around 1000 records.
737-
The fastest two being ``query('col == @series')`` and ``join(series)``:
739+
The fastest two being ``join(series)``:
738740

739741
.. code-block:: python
740742
@@ -743,9 +745,6 @@ The fastest two being ``query('col == @series')`` and ``join(series)``:
743745
744746
df = DataFrame(lst, columns=['ID'])
745747
746-
df.query('index == @series')
747-
# 10 loops, best of 3: 82.9 ms per loop
748-
749748
df.join(series, how='inner')
750749
# 100 loops, best of 3: 19.2 ms per loop
751750
@@ -769,8 +768,7 @@ df.index vs df.column doesn't make a difference here:
769768
df[df.index.isin(series)]
770769
# 1 loops, best of 3: 475 ms per loop
771770
772-
The ``query`` 'in' syntax has the same performance as ``isin``, except
773-
for when using '==' with a ``Series``:
771+
The ``query`` 'in' syntax has the same performance as ``isin``.
774772

775773
.. code-block:: python
776774
@@ -783,13 +781,6 @@ for when using '==' with a ``Series``:
783781
df.query('index == @lst')
784782
# 1 loops, best of 3: 1.03 s per loop
785783
786-
'==' is actually quite a bit faster than 'in' when used against a Series
787-
but not as fast as ``join``.
788-
789-
.. code-block:: python
790-
791-
df.query('index == @series')
792-
# 10 loops, best of 3: 80.5 ms per loop
793784
794785
For ``join``, the data must be the index in the ``DataFrame`` and the index in the ``Series``
795786
for the best performance. The ``Series`` must also have a ``name``. ``join`` defaults to a
@@ -833,8 +824,8 @@ It's actually faster to use ``apply`` or a list comprehension for these small ca
833824
834825
835826
Here is a visualization of some of the benchmarks above. You can see that except for with
836-
very small datasets, ``isin(Series)``, ``join(Series)``, and ``query('col == Series')``
837-
quickly become faster than the pure python data structures.
827+
very small datasets, ``isin(Series)`` and ``join(Series)`` quickly become faster than the
828+
pure python data structures.
838829

839830
.. image:: _static/existence-perf-small.png
840831

0 commit comments

Comments
 (0)