Skip to content

BUG: fixing mixup of bucket and element counts in hash tables #39457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 2, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Jan 28, 2021

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

The issue is that, kh_resize_##name expects the number of buckets in the map as input:

SCOPE void kh_resize_##name(kh_##name##_t *h, khuint_t new_n_buckets) \

it is however used with the number of elements throughout the pandas code, e.g.

table = htable(len(values))

as can be seen in the __cinit__ function:

def __cinit__(self, int64_t size_hint=1):
self.table = kh_init_{{dtype}}()
if size_hint is not None:
size_hint = min(size_hint, SIZE_HINT_LIMIT)
kh_resize_{{dtype}}(self.table, size_hint)

Thus, for some sizes (like 10^3) there still could be a rehashing even if size-hint was used with the idea to avoid it (because at most 75% of buckets can be occupied)

I think it make sense to do the less surprising thing with the size-hint (i.e. the number of elements and not buckets).

@realead
Copy link
Contributor Author

realead commented Jan 28, 2021

I have run quite asv with asv continuous -f 1.05 upstream/master HEAD -b ^hash_functions -b ^indexing.NumericSeriesIndexing and got

       before           after         ratio
     [d0cfa030]       [6d6b9432]
+      7.68±0.3ms       11.2±0.6ms     1.46  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
+      12.4±0.4ms       14.1±0.5ms     1.13  hash_functions.UniqueAndFactorizeArange.time_unique(15)
+      17.3±0.5ms       19.4±0.3ms     1.12  hash_functions.UniqueAndFactorizeArange.time_factorize(11)
+      12.3±0.5ms         13.7±2ms     1.12  hash_functions.UniqueAndFactorizeArange.time_unique(5)
-     3.09±0.04ms      2.89±0.05ms     0.94  hash_functions.IsinAlmostFullWithRandomInt.time_isin_outside(<class 'numpy.int64'>, 17)
-         191±5ms          172±3ms     0.90  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         179±4μs          158±2μs     0.88  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 500000)
-      21.2±0.4ms       18.2±0.2ms     0.86  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-        45.2±1μs         37.4±1μs     0.83  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      17.5±0.6ms       14.5±0.4ms     0.83  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-         226±4ms          185±2ms     0.82  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      7.57±0.1ms      6.07±0.08ms     0.80  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 1000000)
-         233±2ms          183±5ms     0.78  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      4.80±0.4ms      3.59±0.06ms     0.75  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-      4.72±0.6ms       3.38±0.2ms     0.72  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        253±20ms          176±5ms     0.69  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      10.4±0.3ms       7.15±0.2ms     0.69  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-         249±4μs          162±1μs     0.65  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-     4.13±0.03ms       2.51±0.3ms     0.61  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      73.2±0.5μs         44.4±1μs     0.61  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        524±60μs         314±70μs     0.60  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      5.67±0.1ms      3.35±0.08ms     0.59  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-     4.94±0.03ms       2.59±0.1ms     0.53  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      17.2±0.3ms       8.71±0.2ms     0.51  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         597±4μs          301±5μs     0.51  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        18.4±3ms       8.64±0.2ms     0.47  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        97.4±3ms       20.3±0.1ms     0.21  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        99.3±4ms         3.51±1ms     0.04  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        97.4±4ms         306±10μs     0.00  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)

will run the whole test suite over night.

@jreback jreback added Bug Performance Memory or execution speed performance labels Jan 29, 2021
@realead
Copy link
Contributor Author

realead commented Jan 29, 2021

Running for the whole tests suite yields somewhat mixed results. Which is not surprising per se: If there are many duplicates, preallocating bigger tables as really needed means worse performance. However the issue in these cases is the heuristic of how much should be preallocated.

I'm surprised to see a decrise of performance for algorithms.Duplicated.time_duplicated, deduplicated doesn't use HashTable-objects but the underlying c-functions. Need to look into it.

      before           after         ratio
     [d0cfa030]       [6d6b9432]
+     4.87±0.02ms       7.79±0.2ms     1.60  algorithms.Duplicated.time_duplicated(False, False, 'uint')
+      4.81±0.1ms       7.37±0.2ms     1.53  algorithms.Duplicated.time_duplicated(False, False, 'int')
+     4.09±0.04ms       5.74±0.2ms     1.40  algorithms.Duplicated.time_duplicated(False, 'first', 'uint')
+     6.40±0.07ms       8.93±0.3ms     1.40  algorithms.Factorize.time_factorize(False, False, 'uint')
+      2.14±0.2μs       2.96±0.7μs     1.39  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+      6.76±0.4ms       9.25±0.2ms     1.37  algorithms.Factorize.time_factorize(False, False, 'int')
+     3.32±0.01ms       4.48±0.4ms     1.35  groupby.TransformBools.time_transform_mean
+      7.01±0.2ms       9.42±0.3ms     1.34  algorithms.Duplicated.time_duplicated(False, 'first', 'datetime64[ns, tz]')
+     4.08±0.04ms       5.46±0.2ms     1.34  algorithms.Duplicated.time_duplicated(False, 'last', 'uint')
+      7.11±0.3ms       9.41±0.2ms     1.32  algorithms.Duplicated.time_duplicated(False, 'last', 'datetime64[ns, tz]')
+      9.77±0.3ms       12.8±0.3ms     1.31  algorithms.Factorize.time_factorize(False, True, 'uint')
+      7.20±0.6ms       9.39±0.2ms     1.30  algorithms.Duplicated.time_duplicated(False, 'last', 'datetime64[ns]')
+      9.57±0.5ms       12.4±0.1ms     1.29  algorithms.Factorize.time_factorize(False, False, 'datetime64[ns, tz]')
+     4.06±0.04ms       5.25±0.3ms     1.29  algorithms.Duplicated.time_duplicated(False, 'first', 'int')
+      7.37±0.5ms       9.50±0.4ms     1.29  algorithms.Duplicated.time_duplicated(False, 'first', 'datetime64[ns]')
+     4.11±0.06ms       5.29±0.3ms     1.29  algorithms.Duplicated.time_duplicated(False, 'last', 'int')
+      7.83±0.3ms       9.82±0.7ms     1.25  algorithms.Duplicated.time_duplicated(False, False, 'datetime64[ns, tz]')
+      7.42±0.2ms       9.08±0.2ms     1.22  algorithms.Duplicated.time_duplicated(False, 'last', 'float')
+      3.60±0.2μs       4.33±0.6μs     1.21  index_cached_properties.IndexCache.time_values('CategoricalIndex')
+      10.1±0.4ms       12.1±0.3ms     1.20  algorithms.Factorize.time_factorize(False, False, 'datetime64[ns]')
+      10.5±0.2ms       12.5±0.4ms     1.19  algorithms.Factorize.time_factorize(False, True, 'int')
+      7.48±0.2ms       8.86±0.3ms     1.18  algorithms.Duplicated.time_duplicated(False, 'first', 'float')
+      16.9±0.2ms       19.4±0.3ms     1.15  hash_functions.UniqueAndFactorizeArange.time_factorize(11)
+      16.6±0.2ms       19.1±0.3ms     1.15  hash_functions.UniqueAndFactorizeArange.time_factorize(6)
+      16.6±0.5ms       19.1±0.1ms     1.15  hash_functions.UniqueAndFactorizeArange.time_factorize(12)
+         160±3ms         183±10ms     1.14  gil.ParallelGroupbyMethods.time_loop(8, 'mean')
+      16.9±0.3ms       19.3±0.4ms     1.14  hash_functions.UniqueAndFactorizeArange.time_factorize(7)
+         765±9ms        870±200ms     1.14  stat_ops.Correlation.time_corr_wide_nans('spearman')
+      17.4±0.5ms       19.5±0.2ms     1.12  hash_functions.UniqueAndFactorizeArange.time_factorize(8)
+     2.60±0.01μs      2.92±0.05μs     1.12  tslibs.fields.TimeGetTimedeltaField.time_get_timedelta_field(1, 'seconds')
+      17.2±0.4ms       19.2±0.3ms     1.12  hash_functions.UniqueAndFactorizeArange.time_factorize(5)
+     2.61±0.05μs       2.90±0.2μs     1.11  tslibs.fields.TimeGetTimedeltaField.time_get_timedelta_field(1, 'days')
+      3.53±0.2μs       3.92±0.3μs     1.11  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
+      17.4±0.5ms       19.1±0.4ms     1.10  hash_functions.UniqueAndFactorizeArange.time_factorize(9)
-        640±10μs          569±6μs     0.89  timedelta.TimedeltaIndexing.time_union
-        513±40ns          456±3ns     0.89  tslibs.period.PeriodProperties.time_property('M', 'dayofyear')
-         816±8ms         723±10ms     0.89  join_merge.I8Merge.time_i8merge('inner')
-         174±5μs          153±2μs     0.88  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 500000)
-      6.74±0.4μs       5.93±0.4μs     0.88  index_cached_properties.IndexCache.time_is_all_dates('PeriodIndex')
-        866±30μs         759±10μs     0.88  arithmetic.Ops2.time_frame_series_dot
-      2.26±0.2μs       1.97±0.2μs     0.87  index_cached_properties.IndexCache.time_values('PeriodIndex')
-      7.16±0.7μs       6.25±0.4μs     0.87  index_cached_properties.IndexCache.time_engine('PeriodIndex')
-        769±70ns         670±30ns     0.87  index_cached_properties.IndexCache.time_inferred_type('Int64Index')
-     9.33±0.04ms      8.12±0.07ms     0.87  multiindex_object.Integer.time_get_indexer
-      6.53±0.4μs       5.66±0.4μs     0.87  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
-        656±10μs          568±6μs     0.87  groupby.GroupByMethods.time_dtype_as_field('datetime', 'nunique', 'transformation')
-         586±4μs          507±7μs     0.87  groupby.GroupByMethods.time_dtype_as_field('float', 'nunique', 'direct')
-        555±50ns          478±2ns     0.86  tslibs.period.PeriodProperties.time_property('M', 'daysinmonth')
-      21.2±0.4ms       18.1±0.1ms     0.86  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-      4.12±0.3μs       3.52±0.3μs     0.86  index_cached_properties.IndexCache.time_values('Float64Index')
-      3.66±0.8μs       3.12±0.3μs     0.85  index_cached_properties.IndexCache.time_shape('PeriodIndex')
-      2.10±0.1ms      1.79±0.03ms     0.85  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-       167±0.3ms        140±0.9ms     0.84  index_object.Indexing.time_get_loc_non_unique('String')
-     2.33±0.07ms      1.96±0.02ms     0.84  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 500000)
-      5.54±0.2ms      4.64±0.07ms     0.84  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 500000)
-     4.83±0.09ms      4.04±0.04ms     0.84  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-         109±2μs         90.8±3μs     0.84  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-         220±4ms          183±4ms     0.83  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      4.24±0.1ms      3.51±0.09ms     0.83  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-      16.3±0.2ms       13.5±0.5ms     0.83  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         305±8μs          253±4μs     0.83  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      16.7±0.4ms       13.8±0.1ms     0.83  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 1000000)
-     1.87±0.09ms      1.54±0.02ms     0.82  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-         219±5ms          180±5ms     0.82  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        310±10μs          254±2μs     0.82  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
-     2.45±0.07ms      2.00±0.04ms     0.82  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 500000)
-         222±4ms          180±6ms     0.81  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-     1.90±0.02ms      1.55±0.04ms     0.81  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-     16.1±0.09ms       13.1±0.4ms     0.81  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-     4.44±0.07ms      3.59±0.05ms     0.81  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-      16.3±0.2ms       13.2±0.4ms     0.81  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-      7.65±0.3ms      6.12±0.09ms     0.80  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
-      4.56±0.2ms       3.65±0.2ms     0.80  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        7.47±1μs       5.95±0.4μs     0.80  index_cached_properties.IndexCache.time_is_all_dates('DatetimeIndex')
-     3.94±0.09ms      3.13±0.04ms     0.79  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'non_monotonic')
-     1.73±0.05ms      1.36±0.05ms     0.79  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-         112±1ms         87.3±1ms     0.78  multiindex_object.GetLoc.time_large_get_loc_warm
-     1.82±0.04ms      1.41±0.04ms     0.78  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-     2.18±0.06ms      1.67±0.06ms     0.76  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-     2.27±0.02ms      1.68±0.06ms     0.74  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         142±2ms          104±2ms     0.74  frame_methods.Duplicated.time_frame_duplicated
-         104±2ms         74.8±2ms     0.72  join_merge.Align.time_series_align_int64_index
-         484±4μs          321±5μs     0.66  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        90.6±1ms         59.7±2ms     0.66  join_merge.Align.time_series_align_left_monotonic
-     2.68±0.07ms      1.75±0.03ms     0.65  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-         252±4μs          162±4μs     0.64  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-         182±2μs          114±1μs     0.63  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        70.7±1μs       43.7±0.5μs     0.62  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        483±10μs          287±3μs     0.59  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-     4.01±0.05ms      2.37±0.02ms     0.59  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      35.0±0.4ms       20.5±0.2ms     0.59  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-     4.32±0.04ms      2.50±0.05ms     0.58  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-         532±4μs          302±5μs     0.57  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      10.4±0.2ms      5.85±0.06ms     0.56  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-      5.81±0.2ms      3.15±0.04ms     0.54  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-      86.6±0.7ms       46.0±0.5ms     0.53  multiindex_object.Duplicated.time_duplicated
-     5.38±0.09ms      2.75±0.04ms     0.51  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-      17.2±0.3ms      8.50±0.07ms     0.49  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-      17.2±0.3ms       8.46±0.1ms     0.49  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         374±6ms        178±0.9ms     0.47  index_object.Indexing.time_get_loc('String')
-      97.0±0.9ms          137±2μs     0.00  multiindex_object.GetLoc.time_large_get_loc

@realead realead force-pushed the fix_elements_buckets_mixup branch from 6d6b943 to 850b594 Compare January 29, 2021 06:56
@realead realead force-pushed the fix_elements_buckets_mixup branch from 850b594 to c30e410 Compare January 30, 2021 15:58
@jreback jreback added this to the 1.3 milestone Jan 30, 2021
@jreback
Copy link
Contributor

jreback commented Jan 30, 2021

what if you allocated say 2 x needed? does this affect things, e.g. that's what the current code is doing, trading memory for time.

@realead
Copy link
Contributor Author

realead commented Jan 30, 2021

Allocating a bigger table than actually needed should have no positive effects for performance (on an average case). It could however have negative impact on performance: For example if the smaller table fits the cache but the bigger one would not - thus there would be additional cache-misses and a quite big regression of performance.

Something similar happens for example in

+     4.87±0.02ms       7.79±0.2ms     1.60  algorithms.Duplicated.time_duplicated(False, False, 'uint')

the series has 5e5 elements but only 1e5 unique elements, thus prior to this fix the table was about 2 times bigger than needed - after the fix about 4 times bigger than needed - this leads to additional costs.

But the issue is not this fix, but rather the used simple heuristic "preallocate as much space as needed for the worst case", which is good for the worst case, as for example

-      86.6±0.7ms       46.0±0.5ms     0.53  multiindex_object.Duplicated.time_duplicated

where due to the fix a rehashing of the table was avoided and thus about 50% of running time could be saved.

To me it looks as if i is a sane strategy to improve the worst case and to take some potential hits for other cases.

PS: I have changed the behavior of duplicated here f13fefb#diff-f4f1cb812ad02cac95da4070745323e8cf31dfbcc65da52744bb6721439210bbR124

@jreback jreback merged commit 8f14836 into pandas-dev:master Feb 2, 2021
@jreback
Copy link
Contributor

jreback commented Feb 2, 2021

thanks @realead nice as always! keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants