PERF: fix getitem unique_check / initialization issue #14933

jreback · 2016-12-20T21:19:03Z

jreback · 2016-12-20T22:13:13Z

    before     after       ratio
  [708792aa] [a46287a8]
-  169.39ms   153.27ms      0.90  panel_ctor.Constructors1.time_panel_from_dict_all_different_indexes
-   17.17ms    15.36ms      0.89  join_merge.Join.time_join_dataframe_index_multi
-    9.92ms     8.87ms      0.89  index_object.StringIndex.time_boolean_indexer
-    9.17ms     8.18ms      0.89  reindex.LibFastZip.time_lib_fast_zip
-  822.04μs   731.27μs      0.89  reindex.LevelAlign.time_reindex_level
-    3.10ms     2.75ms      0.89  indexing.IndexingMethods.time_take_dtindex
-   63.11ms    55.97ms      0.89  indexing.Int64Indexing.time_ix_list_like
-  316.38μs   280.35μs      0.89  reindex.FillMethod.time_pad
-    4.31ms     3.82ms      0.88  index_object.SetOperations.time_str_symmetric_difference
-   10.10ms     8.88ms      0.88  index_object.Multi3.time_datetime_level_values_full
-    9.96μs     8.65μs      0.87  indexing.IndexingMethods.time_get_loc_float
-   40.57ms    35.20ms      0.87  reindex.Duplicates.time_frame_drop_dups_int
-   65.57ms    56.86ms      0.87  indexing.Int64Indexing.time_loc_scalar
-   62.76ms    54.24ms      0.86  indexing.Int64Indexing.time_ix_array
-   63.77ms    54.83ms      0.86  indexing.Int64Indexing.time_getitem_array
-  433.07ms   370.61ms      0.86  index_object.Multi2.time_sortlevel_int64
-   10.41ms     8.84ms      0.85  index_object.StringIndex.time_boolean_series_indexer
-   63.09ms    53.41ms      0.85  indexing.Int64Indexing.time_getitem_list_like
-    1.55ms     1.31ms      0.84  reindex.Reindexing.time_reindex_multiindex
-   65.60ms    54.52ms      0.83  indexing.Int64Indexing.time_loc_list_like
-  110.24ms    91.35ms      0.83  index_object.Multi1.time_duplicated
-  132.45ms   107.34ms      0.81  join_merge.Align.time_series_align_int64_index
-     2.62s      2.10s      0.80  join_merge.JoinIndex.time_left_outer_join_index
-   63.71ms    50.74ms      0.80  reindex.Align.time_align_series_irregular_string
-    3.35ms     2.65ms      0.79  reindex.Duplicates.time_frame_drop_dups_na_inplace
-  249.99ms   191.72ms      0.77  index_object.SetOperations.time_int64_symmetric_difference
-    2.49ms     1.86ms      0.75  index_object.SetOperations.time_str_difference
-  307.17ms   224.61ms      0.73  indexing.StringIndexing.time_getitem_label_slice
-  304.70ms   215.89ms      0.71  indexing.StringIndexing.time_get_value
-  178.70ms   126.61ms      0.71  index_object.SetOperations.time_int64_difference
-  566.26μs   379.90μs      0.67  reindex.Reindexing.time_reindex_dates
-  141.99ms    94.44ms      0.67  index_object.SetOperations.time_int64_union
-   22.77μs    15.10μs      0.66  indexing.DataFrameIndexing.time_getitem_scalar
-    6.49ms     4.28ms      0.66  index_object.SetOperations.time_datetime_symmetric_difference
-    6.14ms     3.83ms      0.62  index_object.SetOperations.time_datetime_difference
-   95.17ms    46.62ms      0.49  index_object.SetOperations.time_int64_intersection
-    4.09ms     1.60ms      0.39  index_object.SetOperations.time_datetime_difference_disjoint
-  275.79μs   103.85μs      0.38  index_object.SetOperations.time_datetime_union
-     3.30s   223.13ms      0.07  frame_methods.Iteration.time_iteritems_indexing

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

ssanderson · 2016-12-21T00:14:25Z

@jreback am I correct in thinking that the main functional change here is the addition of the second clause in the boolean check here?. EDIT: that link doesn't work. I'll comment inline.

jreback · 2016-12-21T00:23:16Z

@ssanderson no I think was missing a unique_check(since renamed to need_unique_check and I reversed the polarity of these.

The original reason for this (the lazy construction), should be preserved. (I wish we actually had a test for this :<)

The .mapping is None clause was to correct for a case where we hit _ensure_mapping_populated but we actually hadn't populated it yet. (note this doesn't actually happen when over threshold).

The needs_* checks are really just to make things work when you have uniques.

ssanderson · 2016-12-21T00:31:15Z

pandas/index.pyx

@@ -269,7 +261,7 @@ cdef class IndexEngine:
    cdef inline _ensure_mapping_populated(self):
        # need to reset if we have previously
        # set the initialized from monotonic checks
-        if self.unique_check:
+        if self.need_unique_check or self.mapping is None:


So, if I understand correctly, previously we were dispatching on self.unique_check here to determine whether to unset the self.initialized flag here before determining whether to create a populate a hash table. The reason we need to compare against self.unique_check is that, if that flag is set, we might have set self.initialized = 1 even though we don't have a hash table, because we set self.initialized = 1 in the case that we were able to successfully determine uniqueness from the monotonic check.

Did I get that right? If so, would a simpler fix be to just drop the initialized flag entirely and use self.mapping is None everywhere it's currently used? I think that could be done if we also update the is_unique property to gate on the need_unique_check flag instead of the initialized flag, which seems generally clearer anyway.

ssanderson · 2016-12-21T00:37:56Z

no I think was missing a unique_check

Yeah, I realized we'd actually flipped the polarity on a second read.

ssanderson · 2016-12-21T00:40:08Z

Some day I will find the time to finish my branch that was working on this code :/. Maybe I'll have a free night or two over the holidays.

codecov-io · 2016-12-21T01:08:51Z

Current coverage is 84.64% (diff: 100%)

Merging #14933 into master will not change coverage

@@             master     #14933   diff @@
==========================================
  Files           144        144          
  Lines         51025      51025          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43191      43191          
  Misses         7834       7834          
  Partials          0          0

Powered by Codecov. Last update 4c3d4d4...dc32b39

jreback · 2016-12-21T01:46:58Z

@ssanderson I simplified

ssanderson · 2016-12-21T02:04:26Z

pandas/index.pyx

-        if not self.initialized:
+        # populate the mappings
+
+        if self.need_unique_check or not self.initialized:


Do we still need to check against need_unique_check here, or is self.initialized sufficient now?

removed, only need a single routine now.

ssanderson · 2016-12-21T02:09:05Z

pandas/index.pyx


-        self.mapping = self._make_hash_table(len(values))
-        self.mapping.map_locations(values)
+        if not self.initialized:


Do we ever expect to initialize to be called if we're already initialized? If not, should we error and/or log a warning if we hit this branch?

this makes it idempotent. and actually simpler logic.

ssanderson · 2016-12-21T02:44:07Z

👍 from me if tests are passing.

closes pandas-dev#14930 Author: Jeff Reback <[email protected]> Closes pandas-dev#14933 from jreback/perf and squashes the following commits: dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue (cherry picked from commit 07c83ee)

closes pandas-dev#14930 Author: Jeff Reback <[email protected]> Closes pandas-dev#14933 from jreback/perf and squashes the following commits: dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Dec 20, 2016

jreback added this to the 0.19.2 milestone Dec 20, 2016

jreback force-pushed the perf branch from 32abae9 to b04b61e Compare December 20, 2016 21:50

ssanderson reviewed Dec 21, 2016

View reviewed changes

jreback force-pushed the perf branch from b04b61e to 230f436 Compare December 21, 2016 01:44

jreback force-pushed the perf branch from 230f436 to 87e17d4 Compare December 21, 2016 02:00

ssanderson reviewed Dec 21, 2016

View reviewed changes

PERF: fix getitem unique_check / initialization issue

dc32b39

jreback force-pushed the perf branch from 87e17d4 to dc32b39 Compare December 21, 2016 02:10

jreback closed this in 07c83ee Dec 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: fix getitem unique_check / initialization issue #14933

PERF: fix getitem unique_check / initialization issue #14933

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016 •

edited

Loading

ssanderson commented Dec 21, 2016 •

edited

Loading

jreback commented Dec 21, 2016 •

edited

Loading

ssanderson Dec 21, 2016 •

edited

Loading

ssanderson commented Dec 21, 2016

ssanderson commented Dec 21, 2016

codecov-io commented Dec 21, 2016 •

edited

Loading

jreback commented Dec 21, 2016

ssanderson Dec 21, 2016

jreback Dec 21, 2016

ssanderson Dec 21, 2016

jreback Dec 21, 2016

ssanderson commented Dec 21, 2016

PERF: fix getitem unique_check / initialization issue #14933

PERF: fix getitem unique_check / initialization issue #14933

Conversation

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016 • edited Loading

ssanderson commented Dec 21, 2016 • edited Loading

jreback commented Dec 21, 2016 • edited Loading

ssanderson Dec 21, 2016 • edited Loading

Choose a reason for hiding this comment

ssanderson commented Dec 21, 2016

ssanderson commented Dec 21, 2016

codecov-io commented Dec 21, 2016 • edited Loading

Current coverage is 84.64% (diff: 100%)

jreback commented Dec 21, 2016

ssanderson Dec 21, 2016

Choose a reason for hiding this comment

jreback Dec 21, 2016

Choose a reason for hiding this comment

ssanderson Dec 21, 2016

Choose a reason for hiding this comment

jreback Dec 21, 2016

Choose a reason for hiding this comment

ssanderson commented Dec 21, 2016

jreback commented Dec 20, 2016 •

edited

Loading

ssanderson commented Dec 21, 2016 •

edited

Loading

jreback commented Dec 21, 2016 •

edited

Loading

ssanderson Dec 21, 2016 •

edited

Loading

codecov-io commented Dec 21, 2016 •

edited

Loading