BUG: fixing mixup of bucket and element counts in hash tables #39457

realead · 2021-01-28T21:39:09Z

tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

The issue is that, kh_resize_##name expects the number of buckets in the map as input:

pandas/pandas/_libs/src/klib/khash.h

Line 324 in 39e1261

SCOPE void kh_resize_##name(kh_##name##_t *h, khuint_t new_n_buckets) \

it is however used with the number of elements throughout the pandas code, e.g.

pandas/pandas/core/algorithms.py

Line 412 in 39e1261

table = htable(len(values))

as can be seen in the __cinit__ function:

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Lines 400 to 404 in 39e1261

    
           def __cinit__(self, int64_t size_hint=1): 
        
               self.table = kh_init_{{dtype}}() 
        
               if size_hint is not None: 
        
                   size_hint = min(size_hint, SIZE_HINT_LIMIT) 
        
                   kh_resize_{{dtype}}(self.table, size_hint)

Thus, for some sizes (like 10^3) there still could be a rehashing even if size-hint was used with the idea to avoid it (because at most 75% of buckets can be occupied)

I think it make sense to do the less surprising thing with the size-hint (i.e. the number of elements and not buckets).

realead · 2021-01-28T21:43:03Z

I have run quite asv with asv continuous -f 1.05 upstream/master HEAD -b ^hash_functions -b ^indexing.NumericSeriesIndexing and got

       before           after         ratio
     [d0cfa030]       [6d6b9432]
+      7.68±0.3ms       11.2±0.6ms     1.46  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
+      12.4±0.4ms       14.1±0.5ms     1.13  hash_functions.UniqueAndFactorizeArange.time_unique(15)
+      17.3±0.5ms       19.4±0.3ms     1.12  hash_functions.UniqueAndFactorizeArange.time_factorize(11)
+      12.3±0.5ms         13.7±2ms     1.12  hash_functions.UniqueAndFactorizeArange.time_unique(5)
-     3.09±0.04ms      2.89±0.05ms     0.94  hash_functions.IsinAlmostFullWithRandomInt.time_isin_outside(<class 'numpy.int64'>, 17)
-         191±5ms          172±3ms     0.90  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         179±4μs          158±2μs     0.88  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 500000)
-      21.2±0.4ms       18.2±0.2ms     0.86  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-        45.2±1μs         37.4±1μs     0.83  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      17.5±0.6ms       14.5±0.4ms     0.83  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-         226±4ms          185±2ms     0.82  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      7.57±0.1ms      6.07±0.08ms     0.80  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 1000000)
-         233±2ms          183±5ms     0.78  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      4.80±0.4ms      3.59±0.06ms     0.75  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-      4.72±0.6ms       3.38±0.2ms     0.72  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        253±20ms          176±5ms     0.69  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      10.4±0.3ms       7.15±0.2ms     0.69  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-         249±4μs          162±1μs     0.65  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-     4.13±0.03ms       2.51±0.3ms     0.61  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      73.2±0.5μs         44.4±1μs     0.61  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        524±60μs         314±70μs     0.60  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      5.67±0.1ms      3.35±0.08ms     0.59  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-     4.94±0.03ms       2.59±0.1ms     0.53  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      17.2±0.3ms       8.71±0.2ms     0.51  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         597±4μs          301±5μs     0.51  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        18.4±3ms       8.64±0.2ms     0.47  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        97.4±3ms       20.3±0.1ms     0.21  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        99.3±4ms         3.51±1ms     0.04  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        97.4±4ms         306±10μs     0.00  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)

will run the whole test suite over night.

realead · 2021-01-29T06:43:36Z

Running for the whole tests suite yields somewhat mixed results. Which is not surprising per se: If there are many duplicates, preallocating bigger tables as really needed means worse performance. However the issue in these cases is the heuristic of how much should be preallocated.

I'm surprised to see a decrise of performance for algorithms.Duplicated.time_duplicated, deduplicated doesn't use HashTable-objects but the underlying c-functions. Need to look into it.

      before           after         ratio
     [d0cfa030]       [6d6b9432]
+     4.87±0.02ms       7.79±0.2ms     1.60  algorithms.Duplicated.time_duplicated(False, False, 'uint')
+      4.81±0.1ms       7.37±0.2ms     1.53  algorithms.Duplicated.time_duplicated(False, False, 'int')
+     4.09±0.04ms       5.74±0.2ms     1.40  algorithms.Duplicated.time_duplicated(False, 'first', 'uint')
+     6.40±0.07ms       8.93±0.3ms     1.40  algorithms.Factorize.time_factorize(False, False, 'uint')
+      2.14±0.2μs       2.96±0.7μs     1.39  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+      6.76±0.4ms       9.25±0.2ms     1.37  algorithms.Factorize.time_factorize(False, False, 'int')
+     3.32±0.01ms       4.48±0.4ms     1.35  groupby.TransformBools.time_transform_mean
+      7.01±0.2ms       9.42±0.3ms     1.34  algorithms.Duplicated.time_duplicated(False, 'first', 'datetime64[ns, tz]')
+     4.08±0.04ms       5.46±0.2ms     1.34  algorithms.Duplicated.time_duplicated(False, 'last', 'uint')
+      7.11±0.3ms       9.41±0.2ms     1.32  algorithms.Duplicated.time_duplicated(False, 'last', 'datetime64[ns, tz]')
+      9.77±0.3ms       12.8±0.3ms     1.31  algorithms.Factorize.time_factorize(False, True, 'uint')
+      7.20±0.6ms       9.39±0.2ms     1.30  algorithms.Duplicated.time_duplicated(False, 'last', 'datetime64[ns]')
+      9.57±0.5ms       12.4±0.1ms     1.29  algorithms.Factorize.time_factorize(False, False, 'datetime64[ns, tz]')
+     4.06±0.04ms       5.25±0.3ms     1.29  algorithms.Duplicated.time_duplicated(False, 'first', 'int')
+      7.37±0.5ms       9.50±0.4ms     1.29  algorithms.Duplicated.time_duplicated(False, 'first', 'datetime64[ns]')
+     4.11±0.06ms       5.29±0.3ms     1.29  algorithms.Duplicated.time_duplicated(False, 'last', 'int')
+      7.83±0.3ms       9.82±0.7ms     1.25  algorithms.Duplicated.time_duplicated(False, False, 'datetime64[ns, tz]')
+      7.42±0.2ms       9.08±0.2ms     1.22  algorithms.Duplicated.time_duplicated(False, 'last', 'float')
+      3.60±0.2μs       4.33±0.6μs     1.21  index_cached_properties.IndexCache.time_values('CategoricalIndex')
+      10.1±0.4ms       12.1±0.3ms     1.20  algorithms.Factorize.time_factorize(False, False, 'datetime64[ns]')
+      10.5±0.2ms       12.5±0.4ms     1.19  algorithms.Factorize.time_factorize(False, True, 'int')
+      7.48±0.2ms       8.86±0.3ms     1.18  algorithms.Duplicated.time_duplicated(False, 'first', 'float')
+      16.9±0.2ms       19.4±0.3ms     1.15  hash_functions.UniqueAndFactorizeArange.time_factorize(11)
+      16.6±0.2ms       19.1±0.3ms     1.15  hash_functions.UniqueAndFactorizeArange.time_factorize(6)
+      16.6±0.5ms       19.1±0.1ms     1.15  hash_functions.UniqueAndFactorizeArange.time_factorize(12)
+         160±3ms         183±10ms     1.14  gil.ParallelGroupbyMethods.time_loop(8, 'mean')
+      16.9±0.3ms       19.3±0.4ms     1.14  hash_functions.UniqueAndFactorizeArange.time_factorize(7)
+         765±9ms        870±200ms     1.14  stat_ops.Correlation.time_corr_wide_nans('spearman')
+      17.4±0.5ms       19.5±0.2ms     1.12  hash_functions.UniqueAndFactorizeArange.time_factorize(8)
+     2.60±0.01μs      2.92±0.05μs     1.12  tslibs.fields.TimeGetTimedeltaField.time_get_timedelta_field(1, 'seconds')
+      17.2±0.4ms       19.2±0.3ms     1.12  hash_functions.UniqueAndFactorizeArange.time_factorize(5)
+     2.61±0.05μs       2.90±0.2μs     1.11  tslibs.fields.TimeGetTimedeltaField.time_get_timedelta_field(1, 'days')
+      3.53±0.2μs       3.92±0.3μs     1.11  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
+      17.4±0.5ms       19.1±0.4ms     1.10  hash_functions.UniqueAndFactorizeArange.time_factorize(9)
-        640±10μs          569±6μs     0.89  timedelta.TimedeltaIndexing.time_union
-        513±40ns          456±3ns     0.89  tslibs.period.PeriodProperties.time_property('M', 'dayofyear')
-         816±8ms         723±10ms     0.89  join_merge.I8Merge.time_i8merge('inner')
-         174±5μs          153±2μs     0.88  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 500000)
-      6.74±0.4μs       5.93±0.4μs     0.88  index_cached_properties.IndexCache.time_is_all_dates('PeriodIndex')
-        866±30μs         759±10μs     0.88  arithmetic.Ops2.time_frame_series_dot
-      2.26±0.2μs       1.97±0.2μs     0.87  index_cached_properties.IndexCache.time_values('PeriodIndex')
-      7.16±0.7μs       6.25±0.4μs     0.87  index_cached_properties.IndexCache.time_engine('PeriodIndex')
-        769±70ns         670±30ns     0.87  index_cached_properties.IndexCache.time_inferred_type('Int64Index')
-     9.33±0.04ms      8.12±0.07ms     0.87  multiindex_object.Integer.time_get_indexer
-      6.53±0.4μs       5.66±0.4μs     0.87  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
-        656±10μs          568±6μs     0.87  groupby.GroupByMethods.time_dtype_as_field('datetime', 'nunique', 'transformation')
-         586±4μs          507±7μs     0.87  groupby.GroupByMethods.time_dtype_as_field('float', 'nunique', 'direct')
-        555±50ns          478±2ns     0.86  tslibs.period.PeriodProperties.time_property('M', 'daysinmonth')
-      21.2±0.4ms       18.1±0.1ms     0.86  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-      4.12±0.3μs       3.52±0.3μs     0.86  index_cached_properties.IndexCache.time_values('Float64Index')
-      3.66±0.8μs       3.12±0.3μs     0.85  index_cached_properties.IndexCache.time_shape('PeriodIndex')
-      2.10±0.1ms      1.79±0.03ms     0.85  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-       167±0.3ms        140±0.9ms     0.84  index_object.Indexing.time_get_loc_non_unique('String')
-     2.33±0.07ms      1.96±0.02ms     0.84  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 500000)
-      5.54±0.2ms      4.64±0.07ms     0.84  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 500000)
-     4.83±0.09ms      4.04±0.04ms     0.84  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-         109±2μs         90.8±3μs     0.84  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-         220±4ms          183±4ms     0.83  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      4.24±0.1ms      3.51±0.09ms     0.83  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-      16.3±0.2ms       13.5±0.5ms     0.83  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         305±8μs          253±4μs     0.83  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      16.7±0.4ms       13.8±0.1ms     0.83  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 1000000)
-     1.87±0.09ms      1.54±0.02ms     0.82  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-         219±5ms          180±5ms     0.82  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        310±10μs          254±2μs     0.82  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
-     2.45±0.07ms      2.00±0.04ms     0.82  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 500000)
-         222±4ms          180±6ms     0.81  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-     1.90±0.02ms      1.55±0.04ms     0.81  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-     16.1±0.09ms       13.1±0.4ms     0.81  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-     4.44±0.07ms      3.59±0.05ms     0.81  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-      16.3±0.2ms       13.2±0.4ms     0.81  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-      7.65±0.3ms      6.12±0.09ms     0.80  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
-      4.56±0.2ms       3.65±0.2ms     0.80  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        7.47±1μs       5.95±0.4μs     0.80  index_cached_properties.IndexCache.time_is_all_dates('DatetimeIndex')
-     3.94±0.09ms      3.13±0.04ms     0.79  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'non_monotonic')
-     1.73±0.05ms      1.36±0.05ms     0.79  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-         112±1ms         87.3±1ms     0.78  multiindex_object.GetLoc.time_large_get_loc_warm
-     1.82±0.04ms      1.41±0.04ms     0.78  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-     2.18±0.06ms      1.67±0.06ms     0.76  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-     2.27±0.02ms      1.68±0.06ms     0.74  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         142±2ms          104±2ms     0.74  frame_methods.Duplicated.time_frame_duplicated
-         104±2ms         74.8±2ms     0.72  join_merge.Align.time_series_align_int64_index
-         484±4μs          321±5μs     0.66  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        90.6±1ms         59.7±2ms     0.66  join_merge.Align.time_series_align_left_monotonic
-     2.68±0.07ms      1.75±0.03ms     0.65  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-         252±4μs          162±4μs     0.64  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-         182±2μs          114±1μs     0.63  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        70.7±1μs       43.7±0.5μs     0.62  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        483±10μs          287±3μs     0.59  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-     4.01±0.05ms      2.37±0.02ms     0.59  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      35.0±0.4ms       20.5±0.2ms     0.59  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-     4.32±0.04ms      2.50±0.05ms     0.58  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-         532±4μs          302±5μs     0.57  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      10.4±0.2ms      5.85±0.06ms     0.56  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-      5.81±0.2ms      3.15±0.04ms     0.54  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-      86.6±0.7ms       46.0±0.5ms     0.53  multiindex_object.Duplicated.time_duplicated
-     5.38±0.09ms      2.75±0.04ms     0.51  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-      17.2±0.3ms      8.50±0.07ms     0.49  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-      17.2±0.3ms       8.46±0.1ms     0.49  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         374±6ms        178±0.9ms     0.47  index_object.Indexing.time_get_loc('String')
-      97.0±0.9ms          137±2μs     0.00  multiindex_object.GetLoc.time_large_get_loc

…ding rehashing for some cases

…additional checks

jreback · 2021-01-30T20:32:29Z

what if you allocated say 2 x needed? does this affect things, e.g. that's what the current code is doing, trading memory for time.

realead · 2021-01-30T21:47:30Z

Allocating a bigger table than actually needed should have no positive effects for performance (on an average case). It could however have negative impact on performance: For example if the smaller table fits the cache but the bigger one would not - thus there would be additional cache-misses and a quite big regression of performance.

Something similar happens for example in

+     4.87±0.02ms       7.79±0.2ms     1.60  algorithms.Duplicated.time_duplicated(False, False, 'uint')

the series has 5e5 elements but only 1e5 unique elements, thus prior to this fix the table was about 2 times bigger than needed - after the fix about 4 times bigger than needed - this leads to additional costs.

But the issue is not this fix, but rather the used simple heuristic "preallocate as much space as needed for the worst case", which is good for the worst case, as for example

-      86.6±0.7ms       46.0±0.5ms     0.53  multiindex_object.Duplicated.time_duplicated

where due to the fix a rehashing of the table was avoided and thus about 50% of running time could be saved.

To me it looks as if i is a sane strategy to improve the worst case and to take some potential hits for other cases.

PS: I have changed the behavior of duplicated here f13fefb#diff-f4f1cb812ad02cac95da4070745323e8cf31dfbcc65da52744bb6721439210bbR124

jreback · 2021-02-02T02:00:00Z

thanks @realead nice as always! keep em coming!

introduce get_state - a tool to investigate the state of a hashmap

d9153d3

jreback added Bug Performance Memory or execution speed performance labels Jan 29, 2021

realead force-pushed the fix_elements_buckets_mixup branch from 6d6b943 to 850b594 Compare January 29, 2021 06:56

realead added 3 commits January 30, 2021 16:00

BUG: translate number of elements into number of needed buckets, avoi…

f13fefb

…ding rehashing for some cases

size_hint for hash tables does already the right thing, not need for …

bfc517d

…additional checks

improve tests for correct preallocation

c30e410

realead force-pushed the fix_elements_buckets_mixup branch from 850b594 to c30e410 Compare January 30, 2021 15:58

jreback added this to the 1.3 milestone Jan 30, 2021

jreback merged commit 8f14836 into pandas-dev:master Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fixing mixup of bucket and element counts in hash tables #39457

BUG: fixing mixup of bucket and element counts in hash tables #39457

realead commented Jan 28, 2021 •

edited

Loading

realead commented Jan 28, 2021

realead commented Jan 29, 2021

jreback commented Jan 30, 2021

realead commented Jan 30, 2021 •

edited

Loading

jreback commented Feb 2, 2021

	def __cinit__(self, int64_t size_hint=1):
	self.table = kh_init_{{dtype}}()
	if size_hint is not None:
	size_hint = min(size_hint, SIZE_HINT_LIMIT)
	kh_resize_{{dtype}}(self.table, size_hint)

BUG: fixing mixup of bucket and element counts in hash tables #39457

BUG: fixing mixup of bucket and element counts in hash tables #39457

Conversation

realead commented Jan 28, 2021 • edited Loading

realead commented Jan 28, 2021

realead commented Jan 29, 2021

jreback commented Jan 30, 2021

realead commented Jan 30, 2021 • edited Loading

jreback commented Feb 2, 2021

realead commented Jan 28, 2021 •

edited

Loading

realead commented Jan 30, 2021 •

edited

Loading