@@ -672,9 +672,13 @@ by inferring the result type of an expression from its arguments and operators.
672
672
Existence (IsIn, Inner Join, Dict/Hash, Query)
673
673
----------------------------------------------------
674
674
675
- There are a number of different ways to test for existence using pandas. The
676
- following methods can be used to achieve an existence test. The comments correspond
677
- to the legend in the plots further down.
675
+ Existence is the process of testing if an item exists in another list of items, and
676
+ in the case of a DataFrame, we're testing each value of a column for existence in
677
+ another collection of items.
678
+
679
+ There are a number of different ways to test for existence using pandas and the
680
+ following methods are a few of those. The comments correspond to the legend
681
+ in the plots further down.
678
682
679
683
680
684
:meth: `DataFrame.isin `
@@ -694,20 +698,19 @@ to the legend in the plots further down.
694
698
695
699
.. code-block :: python
696
700
701
+ # The '@' symbol is used with `query` to reference local variables. Names
702
+ # without '@' will reference the DataFrame's columns or index.
703
+
697
704
# query_in list
698
705
df.query(' index in @lst' )
699
706
# query_in Series
700
707
df.query(' index in @series' )
701
- # query_in dict
702
- df.query(' index in @dct' )
703
708
704
- # query_eqeq list
705
- df.query(' index == @lst' )
706
- # query_eqeq Series
707
- df.query(' index == @series' )
708
-
709
- # dict actually throws an error with '=='
709
+ # A list can be used with `query('.. == ..')` to test for existence
710
+ # but other data structures such as the `pandas.Series` have
711
+ # a different behaviour.
710
712
713
+ df.query(' index == @lst' )
711
714
712
715
713
716
:meth: `DataFrame.apply `
@@ -717,7 +720,6 @@ to the legend in the plots further down.
717
720
df[df.index.apply(lambda x : x in lst)]
718
721
719
722
720
-
721
723
:meth: `DataFrame.join `
722
724
723
725
.. code-block :: python
@@ -728,13 +730,13 @@ to the legend in the plots further down.
728
730
# this can actually be fast for small DataFrames
729
731
df[[x in dct for x in df.index]]
730
732
731
- # isin_series, query_eqeq Series, query_in Series, pydict,
733
+ # isin_series, query_in Series, pydict,
732
734
# join and isin_list are included in the plots below.
733
735
734
736
735
737
As seen below, generally using a ``Series `` is better than using pure python data
736
738
structures for anything larger than very small datasets of around 1000 records.
737
- The fastest two being ``query('col == @series') `` and `` join(series) ``:
739
+ The fastest two being ``join(series) ``:
738
740
739
741
.. code-block :: python
740
742
@@ -743,9 +745,6 @@ The fastest two being ``query('col == @series')`` and ``join(series)``:
743
745
744
746
df = DataFrame(lst, columns = [' ID' ])
745
747
746
- df.query(' index == @series' )
747
- # 10 loops, best of 3: 82.9 ms per loop
748
-
749
748
df.join(series, how = ' inner' )
750
749
# 100 loops, best of 3: 19.2 ms per loop
751
750
@@ -769,8 +768,7 @@ df.index vs df.column doesn't make a difference here:
769
768
df[df.index.isin(series)]
770
769
# 1 loops, best of 3: 475 ms per loop
771
770
772
- The ``query `` 'in' syntax has the same performance as ``isin ``, except
773
- for when using '==' with a ``Series ``:
771
+ The ``query `` 'in' syntax has the same performance as ``isin ``.
774
772
775
773
.. code-block :: python
776
774
@@ -783,13 +781,6 @@ for when using '==' with a ``Series``:
783
781
df.query(' index == @lst' )
784
782
# 1 loops, best of 3: 1.03 s per loop
785
783
786
- '==' is actually quite a bit faster than 'in' when used against a Series
787
- but not as fast as ``join ``.
788
-
789
- .. code-block :: python
790
-
791
- df.query(' index == @series' )
792
- # 10 loops, best of 3: 80.5 ms per loop
793
784
794
785
For ``join ``, the data must be the index in the ``DataFrame `` and the index in the ``Series ``
795
786
for the best performance. The ``Series `` must also have a ``name ``. ``join `` defaults to a
@@ -833,8 +824,8 @@ It's actually faster to use ``apply`` or a list comprehension for these small ca
833
824
834
825
835
826
Here is a visualization of some of the benchmarks above. You can see that except for with
836
- very small datasets, ``isin(Series) ``, ``join(Series) ``, and `` query('col == Series') ``
837
- quickly become faster than the pure python data structures.
827
+ very small datasets, ``isin(Series) `` and ``join(Series) `` quickly become faster than the
828
+ pure python data structures.
838
829
839
830
.. image :: _static/existence-perf-small.png
840
831
0 commit comments