@@ -668,3 +668,180 @@ In general, :meth:`DataFrame.query`/:func:`pandas.eval` will
668
668
evaluate the subexpressions that *can * be evaluated by ``numexpr `` and those
669
669
that must be evaluated in Python space transparently to the user. This is done
670
670
by inferring the result type of an expression from its arguments and operators.
671
+
672
+ Existence (IsIn, Inner Join, Dict/Hash, Query)
673
+ ----------------------------------------------------
674
+
675
+ There are a number of different ways to test for existence using pandas. The
676
+ following methods can be used to achieve an existence test. The comments correspond
677
+ to the legend in the plots further down.
678
+
679
+
680
+ :meth: `DataFrame.isin `
681
+
682
+ .. code-block :: python
683
+
684
+ # isin_list
685
+ df[df.index.isin(lst)]
686
+ # isin_dict
687
+ df[df.index.isin(dct)]
688
+ # isin_series
689
+ df[df.index.isin(series)]
690
+
691
+
692
+
693
+ :meth: `DataFrame.query `
694
+
695
+ .. code-block :: python
696
+
697
+ # query_in list
698
+ df.query(' index in @lst' )
699
+ # query_in Series
700
+ df.query(' index in @series' )
701
+ # query_in dict
702
+ df.query(' index in @dct' )
703
+
704
+ # query_eqeq list
705
+ df.query(' index == @lst' )
706
+ # query_eqeq Series
707
+ df.query(' index == @series' )
708
+
709
+ # dict actually throws an error with '=='
710
+
711
+
712
+
713
+ :meth: `DataFrame.apply `
714
+
715
+ .. code-block :: python
716
+
717
+ df[df.index.apply(lambda x : x in lst)]
718
+
719
+
720
+
721
+ :meth: `DataFrame.join `
722
+
723
+ .. code-block :: python
724
+
725
+ # join
726
+ df.join(lst, how = ' inner' )
727
+
728
+ # this can actually be fast for small DataFrames
729
+ df[[x in dct for x in df.index]]
730
+
731
+ # isin_series, query_eqeq Series, query_in Series, pydict,
732
+ # join and isin_list are included in the plots below.
733
+
734
+
735
+ As seen below, generally using a ``Series `` is better than using pure python data
736
+ structures for anything larger than very small datasets of around 1000 records.
737
+ The fastest two being ``query('col == @series') `` and ``join(series) ``:
738
+
739
+ .. code-block :: python
740
+
741
+ lst = range (1000000 )
742
+ series = Series(lst, name = ' data' )
743
+
744
+ df = DataFrame(lst, columns = [' ID' ])
745
+
746
+ df.query(' index == @series' )
747
+ # 10 loops, best of 3: 82.9 ms per loop
748
+
749
+ df.join(series, how = ' inner' )
750
+ # 100 loops, best of 3: 19.2 ms per loop
751
+
752
+ list vs Series:
753
+
754
+ .. code-block :: python
755
+
756
+ df[df.index.isin(lst)]
757
+ # 1 loops, best of 3: 1.06 s per loop
758
+
759
+ df[df.index.isin(series)]
760
+ # 1 loops, best of 3: 477 ms per loop
761
+
762
+ df.index vs df.column doesn't make a difference here:
763
+
764
+ .. code-block :: python
765
+
766
+ df[df.ID .isin(series)]
767
+ # 1 loops, best of 3: 474 ms per loop
768
+
769
+ df[df.index.isin(series)]
770
+ # 1 loops, best of 3: 475 ms per loop
771
+
772
+ The ``query `` 'in' syntax has the same performance as ``isin ``, except
773
+ for when using '==' with a ``Series ``:
774
+
775
+ .. code-block :: python
776
+
777
+ df.query(' index in @lst' )
778
+ # 1 loops, best of 3: 1.04 s per loop
779
+
780
+ df.query(' index in @series' )
781
+ # 1 loops, best of 3: 451 ms per loop
782
+
783
+ df.query(' index == @lst' )
784
+ # 1 loops, best of 3: 1.03 s per loop
785
+
786
+ '==' is actually quite a bit faster than 'in' when used against a Series
787
+ but not as fast as ``join ``.
788
+
789
+ .. code-block :: python
790
+
791
+ df.query(' index == @series' )
792
+ # 10 loops, best of 3: 80.5 ms per loop
793
+
794
+ For ``join ``, the data must be the index in the ``DataFrame `` and the index in the ``Series ``
795
+ for the best performance. The ``Series `` must also have a ``name ``. ``join `` defaults to a
796
+ left join so we need to specify 'inner' for existence.
797
+
798
+ .. code-block :: python
799
+
800
+ df.join(series, how = ' inner' )
801
+ # 100 loops, best of 3: 19.7 ms per loop
802
+
803
+ Smaller datasets:
804
+
805
+ .. code-block :: python
806
+
807
+ df = DataFrame([1 ,2 ,3 ,4 ], columns = [' ID' ])
808
+ lst = range (10000 )
809
+ dct = dict (zip (lst, lst))
810
+ series = Series(lst, name = ' data' )
811
+
812
+ df.join(series, how = ' inner' )
813
+ # 1000 loops, best of 3: 866 us per loop
814
+
815
+ df[df.ID .isin(dct)]
816
+ # 1000 loops, best of 3: 809 us per loop
817
+
818
+ df[df.ID .isin(lst)]
819
+ # 1000 loops, best of 3: 853 us per loop
820
+
821
+ df[df.ID .isin(series)]
822
+ # 100 loops, best of 3: 2.22 ms per loop
823
+
824
+ It's actually faster to use ``apply `` or a list comprehension for these small cases.
825
+
826
+ .. code-block :: python
827
+
828
+ df[[x in dct for x in df.ID ]]
829
+ # 1000 loops, best of 3: 266 us per loop
830
+
831
+ df[df.ID .apply(lambda x : x in dct)]
832
+ # 1000 loops, best of 3: 364 us per loop
833
+
834
+
835
+ Here is a visualization of some of the benchmarks above. You can see that except for with
836
+ very small datasets, ``isin(Series) ``, ``join(Series) ``, and ``query('col == Series') ``
837
+ quickly become faster than the pure python data structures.
838
+
839
+ .. image :: _static/existence-perf-small.png
840
+
841
+ However, ``isin(Series) `` still presents fairly poor exponential performance where ``join `` is quite
842
+ fast for large datasets. There is some overhead involved in ensuring your data is the index
843
+ in both your left and right datasets but that time should be clearly outweighed by the gains of
844
+ the join itself. For extremely large datasets, you may start bumping into memory limits since ``join ``
845
+ does not perform any disk chunking, etc.
846
+
847
+ .. image :: _static/existence-perf-large.png
0 commit comments