Skip to content

PERF: fix getitem unique_check / initialization issue #14933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Dec 20, 2016

closes #14930

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Dec 20, 2016
@jreback jreback added this to the 0.19.2 milestone Dec 20, 2016
@jreback
Copy link
Contributor Author

jreback commented Dec 20, 2016

    before     after       ratio
  [708792aa] [a46287a8]
-  169.39ms   153.27ms      0.90  panel_ctor.Constructors1.time_panel_from_dict_all_different_indexes
-   17.17ms    15.36ms      0.89  join_merge.Join.time_join_dataframe_index_multi
-    9.92ms     8.87ms      0.89  index_object.StringIndex.time_boolean_indexer
-    9.17ms     8.18ms      0.89  reindex.LibFastZip.time_lib_fast_zip
-  822.04μs   731.27μs      0.89  reindex.LevelAlign.time_reindex_level
-    3.10ms     2.75ms      0.89  indexing.IndexingMethods.time_take_dtindex
-   63.11ms    55.97ms      0.89  indexing.Int64Indexing.time_ix_list_like
-  316.38μs   280.35μs      0.89  reindex.FillMethod.time_pad
-    4.31ms     3.82ms      0.88  index_object.SetOperations.time_str_symmetric_difference
-   10.10ms     8.88ms      0.88  index_object.Multi3.time_datetime_level_values_full
-    9.96μs     8.65μs      0.87  indexing.IndexingMethods.time_get_loc_float
-   40.57ms    35.20ms      0.87  reindex.Duplicates.time_frame_drop_dups_int
-   65.57ms    56.86ms      0.87  indexing.Int64Indexing.time_loc_scalar
-   62.76ms    54.24ms      0.86  indexing.Int64Indexing.time_ix_array
-   63.77ms    54.83ms      0.86  indexing.Int64Indexing.time_getitem_array
-  433.07ms   370.61ms      0.86  index_object.Multi2.time_sortlevel_int64
-   10.41ms     8.84ms      0.85  index_object.StringIndex.time_boolean_series_indexer
-   63.09ms    53.41ms      0.85  indexing.Int64Indexing.time_getitem_list_like
-    1.55ms     1.31ms      0.84  reindex.Reindexing.time_reindex_multiindex
-   65.60ms    54.52ms      0.83  indexing.Int64Indexing.time_loc_list_like
-  110.24ms    91.35ms      0.83  index_object.Multi1.time_duplicated
-  132.45ms   107.34ms      0.81  join_merge.Align.time_series_align_int64_index
-     2.62s      2.10s      0.80  join_merge.JoinIndex.time_left_outer_join_index
-   63.71ms    50.74ms      0.80  reindex.Align.time_align_series_irregular_string
-    3.35ms     2.65ms      0.79  reindex.Duplicates.time_frame_drop_dups_na_inplace
-  249.99ms   191.72ms      0.77  index_object.SetOperations.time_int64_symmetric_difference
-    2.49ms     1.86ms      0.75  index_object.SetOperations.time_str_difference
-  307.17ms   224.61ms      0.73  indexing.StringIndexing.time_getitem_label_slice
-  304.70ms   215.89ms      0.71  indexing.StringIndexing.time_get_value
-  178.70ms   126.61ms      0.71  index_object.SetOperations.time_int64_difference
-  566.26μs   379.90μs      0.67  reindex.Reindexing.time_reindex_dates
-  141.99ms    94.44ms      0.67  index_object.SetOperations.time_int64_union
-   22.77μs    15.10μs      0.66  indexing.DataFrameIndexing.time_getitem_scalar
-    6.49ms     4.28ms      0.66  index_object.SetOperations.time_datetime_symmetric_difference
-    6.14ms     3.83ms      0.62  index_object.SetOperations.time_datetime_difference
-   95.17ms    46.62ms      0.49  index_object.SetOperations.time_int64_intersection
-    4.09ms     1.60ms      0.39  index_object.SetOperations.time_datetime_difference_disjoint
-  275.79μs   103.85μs      0.38  index_object.SetOperations.time_datetime_union
-     3.30s   223.13ms      0.07  frame_methods.Iteration.time_iteritems_indexing

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@ssanderson
Copy link
Contributor

ssanderson commented Dec 21, 2016

@jreback am I correct in thinking that the main functional change here is the addition of the second clause in the boolean check here?. EDIT: that link doesn't work. I'll comment inline.

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2016

@ssanderson no I think was missing a unique_check(since renamed to need_unique_check and I reversed the polarity of these.

The original reason for this (the lazy construction), should be preserved. (I wish we actually had a test for this :<)

The .mapping is None clause was to correct for a case where we hit _ensure_mapping_populated but we actually hadn't populated it yet. (note this doesn't actually happen when over threshold).

The needs_* checks are really just to make things work when you have uniques.

@@ -269,7 +261,7 @@ cdef class IndexEngine:
cdef inline _ensure_mapping_populated(self):
# need to reset if we have previously
# set the initialized from monotonic checks
if self.unique_check:
if self.need_unique_check or self.mapping is None:
Copy link
Contributor

@ssanderson ssanderson Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I understand correctly, previously we were dispatching on self.unique_check here to determine whether to unset the self.initialized flag here before determining whether to create a populate a hash table. The reason we need to compare against self.unique_check is that, if that flag is set, we might have set self.initialized = 1 even though we don't have a hash table, because we set self.initialized = 1 in the case that we were able to successfully determine uniqueness from the monotonic check.

Did I get that right? If so, would a simpler fix be to just drop the initialized flag entirely and use self.mapping is None everywhere it's currently used? I think that could be done if we also update the is_unique property to gate on the need_unique_check flag instead of the initialized flag, which seems generally clearer anyway.

@ssanderson
Copy link
Contributor

no I think was missing a unique_check

Yeah, I realized we'd actually flipped the polarity on a second read.

@ssanderson
Copy link
Contributor

Some day I will find the time to finish my branch that was working on this code :/. Maybe I'll have a free night or two over the holidays.

@codecov-io
Copy link

codecov-io commented Dec 21, 2016

Current coverage is 84.64% (diff: 100%)

Merging #14933 into master will not change coverage

@@             master     #14933   diff @@
==========================================
  Files           144        144          
  Lines         51025      51025          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43191      43191          
  Misses         7834       7834          
  Partials          0          0          

Powered by Codecov. Last update 4c3d4d4...dc32b39

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2016

@ssanderson I simplified

if not self.initialized:
# populate the mappings

if self.need_unique_check or not self.initialized:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to check against need_unique_check here, or is self.initialized sufficient now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, only need a single routine now.


self.mapping = self._make_hash_table(len(values))
self.mapping.map_locations(values)
if not self.initialized:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever expect to initialize to be called if we're already initialized? If not, should we error and/or log a warning if we hit this branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes it idempotent. and actually simpler logic.

@ssanderson
Copy link
Contributor

👍 from me if tests are passing.

@jreback jreback closed this in 07c83ee Dec 21, 2016
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Dec 24, 2016
closes pandas-dev#14930

Author: Jeff Reback <[email protected]>

Closes pandas-dev#14933 from jreback/perf and squashes the following commits:

dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue

(cherry picked from commit 07c83ee)
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
closes pandas-dev#14930

Author: Jeff Reback <[email protected]>

Closes pandas-dev#14933 from jreback/perf and squashes the following commits:

dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame __getitem__ ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching?
3 participants