[PERF] taking upper 32bit of PyObject_Hash into account #39592

realead · 2021-02-04T07:05:23Z

closes BUG: Hash-function doesn't take upper bits into consideration for PyObjects #37615
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Until now the upper 32bits of PyObject_Hash aren't taken into account at all. Because for my built-in objects the hash function is very simple (e.g. integers) many "normal" series would have all hashes being 0, which would lead to O(n^2) running times.

Thus for 64bit builds, we need to mangle the upper 32bit into the resulting 32bit-hash.

realead · 2021-02-04T07:09:10Z

asv_bench gives:

       before           after         ratio
     [dae99c72]       [832d9d73]
+       123±0.4ms          367±6ms     2.99  series_methods.IsInLongSeriesValuesDominate.time_isin('object', 'monotone')
+      47.9±0.6ms          103±2ms     2.16  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 1000000)
+     1.15±0.03ms      2.44±0.08ms     2.13  series_methods.IsInForObjects.time_isin_short_series_long_values
+     3.60±0.04ms       6.12±0.2ms     1.70  series_methods.IsInForObjects.time_isin_long_series_long_values
+     3.63±0.09ms      5.77±0.09ms     1.59  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 100000)
+         283±4ms          385±3ms     1.36  series_methods.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'random_misses')
+      3.59±0.2μs       4.75±0.3μs     1.32  index_cached_properties.IndexCache.time_values('CategoricalIndex')
+      6.15±0.1ms       7.86±0.3ms     1.28  series_methods.IsInForObjects.time_isin_long_series_long_values_floats
+      44.3±0.7ms         56.1±1ms     1.27  categoricals.Indexing.time_reindex_missing
+      6.09±0.5μs       7.43±0.6μs     1.22  index_cached_properties.IndexCache.time_is_all_dates('DatetimeIndex')
+      3.54±0.2μs       4.26±0.3μs     1.20  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
+      5.41±0.6μs       6.50±0.5μs     1.20  index_cached_properties.IndexCache.time_shape('CategoricalIndex')
+      2.17±0.3μs       2.49±0.2μs     1.15  index_cached_properties.IndexCache.time_values('PeriodIndex')
+        461±10μs         527±10μs     1.14  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 8000)
+     7.50±0.09ms      8.51±0.07ms     1.14  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+        198±10μs         225±50μs     1.14  index_cached_properties.IndexCache.time_is_monotonic_increasing('DatetimeIndex')
+     8.35±0.07ms       9.45±0.1ms     1.13  multiindex_object.Integer.time_get_indexer
+      29.0±0.1ms       32.2±0.5ms     1.11  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, 0)
+        916±40ns      1.02±0.05μs     1.11  index_cached_properties.IndexCache.time_is_monotonic_increasing('RangeIndex')
+      65.1±0.2ms         72.1±1ms     1.11  hash_functions.IsinWithArange.time_isin(<class 'object'>, 1000, 2)
+     1.23±0.05μs      1.36±0.09μs     1.11  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
+      3.18±0.3μs       3.52±0.3μs     1.11  index_cached_properties.IndexCache.time_shape('DatetimeIndex')
-     1.13±0.08ms         981±20μs     0.87  dtypes.SelectDtypes.time_select_dtype_string_include('float64')
-       106±0.5ms       91.6±0.7ms     0.87  hash_functions.IsinWithArange.time_isin(<class 'object'>, 2000, -2)
-        529±10μs         458±10μs     0.87  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 2000)
-        345±10μs          297±8μs     0.86  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 1300)
-        327±20μs          281±9μs     0.86  arithmetic.NumericInferOps.time_multiply(<class 'numpy.int8'>)
-        7.24±1μs       6.18±0.5μs     0.85  index_cached_properties.IndexCache.time_engine('PeriodIndex')
-         333±6μs          282±3μs     0.85  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 1300)
-     1.86±0.06ms      1.56±0.04ms     0.84  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 8000)
-         547±8μs          457±8μs     0.84  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 2000)
-      3.97±0.7μs       3.32±0.5μs     0.83  index_cached_properties.IndexCache.time_shape('PeriodIndex')
-     1.92±0.03ms      1.59±0.03ms     0.83  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 8000)
-     1.64±0.03ms      1.35±0.03ms     0.82  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 7000)
-      23.4±0.4ms         19.1±1ms     0.82  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 80000)
-     1.75±0.06ms      1.40±0.02ms     0.80  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 7000)
-      18.8±0.9ms       14.9±0.7ms     0.80  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 70000)
-        712±10ms          562±5ms     0.79  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 900000)
-         495±8ms         383±10ms     0.77  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 750000)
-         531±2ms          385±3ms     0.72  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 750000)
-        17.1±1ms       12.3±0.6ms     0.72  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 70000)
-         659±7ms          441±5ms     0.67  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 900000)
-      95.6±0.6ms       62.8±0.3ms     0.66  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, -2)
-       138±0.7ms       87.1±0.7ms     0.63  hash_functions.IsinWithArange.time_isin(<class 'object'>, 2000, 2)
-         110±1ms       62.6±0.7ms     0.57  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, 2)
-      72.2±0.3ms       12.0±0.3ms     0.17  hash_functions.IsinWithArange.time_isin(<class 'object'>, 1000, -2)
-         371±3ms         694±20μs     0.00  hash_functions.UniqueForLargePyObjectInts.time_unique

As always, when changing a hash functions, it is a mixed bag: for some cases the old hash-function was better, for some the new.

However, one can see, that the O(n^2) running time is avoided now:

-         371±3ms         694±20μs     0.00  hash_functions.UniqueForLargePyObjectInts.time_unique

jreback · 2021-02-04T15:18:39Z

  123±0.4ms          367±6ms     2.99  series_methods.IsInLongSeriesValuesDominate.time_isin('object', 'monotone')

 47.9±0.6ms          103±2ms     2.16  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 1000000)

are these degenerate?

realead · 2021-02-04T21:55:29Z

@jreback

After thinking about it, I have opted for a slighly different hash-reduction from 64bit->32bit. It now has the advantages that it fixes the O(n^2) issue but also makes otherwise minimal changes to the behavior in other cases, thus keeping a better original performance in some corner cases you have highlighted.

The timings are now

       before           after         ratio
     [dae99c72]       [2fe320b7]
+      5.60±0.3μs       7.04±0.9μs     1.26  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
+      10.5±0.6μs         12.2±2μs     1.16  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')
+      2.06±0.2μs       2.38±0.3μs     1.15  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+     3.64±0.05ms      4.20±0.09ms     1.15  series_methods.IsInForObjects.time_isin_long_series_long_values
+      3.82±0.2μs       4.39±0.5μs     1.15  index_cached_properties.IndexCache.time_values('TimedeltaIndex')
+      3.51±0.2μs       3.96±0.4μs     1.13  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
+      10.9±0.5μs       12.2±0.9μs     1.12  index_cached_properties.IndexCache.time_engine('UInt64Index')
+      4.85±0.4μs       5.37±0.3μs     1.11  index_cached_properties.IndexCache.time_shape('UInt64Index')
+        11.1±1μs       12.3±0.8μs     1.10  index_cached_properties.IndexCache.time_engine('Float64Index')
+         148±2ms          161±3ms     1.09  series_methods.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'random_hits')
+      6.28±0.5μs       6.82±0.6μs     1.09  index_cached_properties.IndexCache.time_is_all_dates('DatetimeIndex')
+      4.00±0.4μs       4.31±0.4μs     1.08  index_cached_properties.IndexCache.time_inferred_type('IntervalIndex')
+      12.1±0.7μs       12.9±0.7μs     1.07  index_cached_properties.IndexCache.time_is_all_dates('IntervalIndex')
+      9.87±0.6μs       10.5±0.4μs     1.06  index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')
+         129±2ms          136±2ms     1.05  series_methods.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'monotone_hits')
-        700±40ns         664±30ns     0.95  index_cached_properties.IndexCache.time_is_unique('RangeIndex')
-     1.01±0.06μs         955±50ns     0.95  index_cached_properties.IndexCache.time_is_monotonic_increasing('RangeIndex')
-     1.82±0.04ms      1.72±0.01ms     0.94  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-     1.39±0.05μs      1.30±0.06μs     0.94  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
-      3.85±0.3μs       3.61±0.2μs     0.94  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
-         421±9ms          390±6ms     0.93  series_methods.IsInLongSeriesValuesDominate.time_isin('int32', 'random')
-     1.26±0.09μs      1.16±0.06μs     0.92  index_cached_properties.IndexCache.time_is_monotonic('RangeIndex')
-        739±50ns         678±30ns     0.92  index_cached_properties.IndexCache.time_is_unique('Int64Index')
-      3.95±0.3μs       3.61±0.2μs     0.91  index_cached_properties.IndexCache.time_values('Float64Index')
-      7.56±0.9μs       6.87±0.7μs     0.91  index_cached_properties.IndexCache.time_is_all_dates('PeriodIndex')
-      4.13±0.3μs       3.74±0.2μs     0.91  index_cached_properties.IndexCache.time_values('IntervalIndex')
-        773±30ns         692±30ns     0.89  index_cached_properties.IndexCache.time_inferred_type('Int64Index')
-        810±30ns         686±60ns     0.85  index_cached_properties.IndexCache.time_inferred_type('RangeIndex')
-        353±20μs         294±10μs     0.84  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 1300)
-        547±40μs         450±10μs     0.82  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 2000)
-      19.2±0.6ms       15.1±0.7ms     0.78  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 70000)
-     1.79±0.09ms      1.40±0.05ms     0.78  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 7000)
-     1.98±0.04ms      1.54±0.05ms     0.78  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 8000)
-        17.3±1ms       13.2±0.8ms     0.76  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 70000)
-        741±20ms         561±10ms     0.76  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 900000)
-        503±20ms          379±9ms     0.75  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 750000)
-         564±8ms         388±10ms     0.69  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 750000)
-         670±7ms          459±3ms     0.68  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 900000)
-      73.2±0.8ms       11.9±0.2ms     0.16  hash_functions.IsinWithArange.time_isin(<class 'object'>, 1000, -2)
-      96.5±0.5ms       12.4±0.3ms     0.13  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, -2)
-         107±1ms       12.0±0.5ms     0.11  hash_functions.IsinWithArange.time_isin(<class 'object'>, 2000, -2)
-         371±4ms         399±10μs     0.00  hash_functions.UniqueForLargePyObjectInts.time_unique

jreback · 2021-02-04T22:42:52Z

ok @realead looks good!

jreback · 2021-02-05T03:21:40Z

can you add a whatsnew note (alt ok to just add this issue onto one of the previous ones for hashing). merge master and ping on greenish

realead · 2021-02-06T07:56:09Z

@jreback green

jreback · 2021-02-07T16:42:11Z

thanks @realead

…9592)

…ng defined in pandas-dev#39592

jreback added the Performance Memory or execution speed performance label Feb 4, 2021

jreback added this to the 1.3 milestone Feb 4, 2021

realead added 3 commits February 5, 2021 23:40

taking upper 32bit of PyHash into account as well

c4bf1ed

using a simpler hash, minimizing changes to the original state

57dceb1

adding whatsnew

1a8c2dd

realead force-pushed the fix_gh_37615 branch from 9e69c7d to 1a8c2dd Compare February 5, 2021 22:43

jreback merged commit dbb88c7 into pandas-dev:master Feb 7, 2021

CyberQin pushed a commit to CyberQin/pandas that referenced this pull request Feb 8, 2021

[PERF] taking upper 32bit of PyObject_Hash into account (pandas-dev#3…

ba60690

…9592)

realead deleted the fix_gh_37615 branch February 9, 2021 07:17

realead added a commit to realead/pandas that referenced this pull request Jun 19, 2021

fix signess (should be unsigned) of the return type for hash, was wro…

b2ecad5

…ng defined in pandas-dev#39592

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] taking upper 32bit of PyObject_Hash into account #39592

[PERF] taking upper 32bit of PyObject_Hash into account #39592

realead commented Feb 4, 2021 •

edited

Loading

realead commented Feb 4, 2021

jreback commented Feb 4, 2021

realead commented Feb 4, 2021

jreback commented Feb 4, 2021

jreback commented Feb 5, 2021

realead commented Feb 6, 2021

jreback commented Feb 7, 2021

[PERF] taking upper 32bit of PyObject_Hash into account #39592

[PERF] taking upper 32bit of PyObject_Hash into account #39592

Conversation

realead commented Feb 4, 2021 • edited Loading

realead commented Feb 4, 2021

jreback commented Feb 4, 2021

realead commented Feb 4, 2021

jreback commented Feb 4, 2021

jreback commented Feb 5, 2021

realead commented Feb 6, 2021

jreback commented Feb 7, 2021

realead commented Feb 4, 2021 •

edited

Loading