Skip to content

[PERF] taking upper 32bit of PyObject_Hash into account #39592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 7, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Feb 4, 2021

Until now the upper 32bits of PyObject_Hash aren't taken into account at all. Because for my built-in objects the hash function is very simple (e.g. integers) many "normal" series would have all hashes being 0, which would lead to O(n^2) running times.

Thus for 64bit builds, we need to mangle the upper 32bit into the resulting 32bit-hash.

@realead
Copy link
Contributor Author

realead commented Feb 4, 2021

asv_bench gives:

       before           after         ratio
     [dae99c72]       [832d9d73]
+       123±0.4ms          367±6ms     2.99  series_methods.IsInLongSeriesValuesDominate.time_isin('object', 'monotone')
+      47.9±0.6ms          103±2ms     2.16  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 1000000)
+     1.15±0.03ms      2.44±0.08ms     2.13  series_methods.IsInForObjects.time_isin_short_series_long_values
+     3.60±0.04ms       6.12±0.2ms     1.70  series_methods.IsInForObjects.time_isin_long_series_long_values
+     3.63±0.09ms      5.77±0.09ms     1.59  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 100000)
+         283±4ms          385±3ms     1.36  series_methods.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'random_misses')
+      3.59±0.2μs       4.75±0.3μs     1.32  index_cached_properties.IndexCache.time_values('CategoricalIndex')
+      6.15±0.1ms       7.86±0.3ms     1.28  series_methods.IsInForObjects.time_isin_long_series_long_values_floats
+      44.3±0.7ms         56.1±1ms     1.27  categoricals.Indexing.time_reindex_missing
+      6.09±0.5μs       7.43±0.6μs     1.22  index_cached_properties.IndexCache.time_is_all_dates('DatetimeIndex')
+      3.54±0.2μs       4.26±0.3μs     1.20  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
+      5.41±0.6μs       6.50±0.5μs     1.20  index_cached_properties.IndexCache.time_shape('CategoricalIndex')
+      2.17±0.3μs       2.49±0.2μs     1.15  index_cached_properties.IndexCache.time_values('PeriodIndex')
+        461±10μs         527±10μs     1.14  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 8000)
+     7.50±0.09ms      8.51±0.07ms     1.14  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+        198±10μs         225±50μs     1.14  index_cached_properties.IndexCache.time_is_monotonic_increasing('DatetimeIndex')
+     8.35±0.07ms       9.45±0.1ms     1.13  multiindex_object.Integer.time_get_indexer
+      29.0±0.1ms       32.2±0.5ms     1.11  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, 0)
+        916±40ns      1.02±0.05μs     1.11  index_cached_properties.IndexCache.time_is_monotonic_increasing('RangeIndex')
+      65.1±0.2ms         72.1±1ms     1.11  hash_functions.IsinWithArange.time_isin(<class 'object'>, 1000, 2)
+     1.23±0.05μs      1.36±0.09μs     1.11  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
+      3.18±0.3μs       3.52±0.3μs     1.11  index_cached_properties.IndexCache.time_shape('DatetimeIndex')
-     1.13±0.08ms         981±20μs     0.87  dtypes.SelectDtypes.time_select_dtype_string_include('float64')
-       106±0.5ms       91.6±0.7ms     0.87  hash_functions.IsinWithArange.time_isin(<class 'object'>, 2000, -2)
-        529±10μs         458±10μs     0.87  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 2000)
-        345±10μs          297±8μs     0.86  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 1300)
-        327±20μs          281±9μs     0.86  arithmetic.NumericInferOps.time_multiply(<class 'numpy.int8'>)
-        7.24±1μs       6.18±0.5μs     0.85  index_cached_properties.IndexCache.time_engine('PeriodIndex')
-         333±6μs          282±3μs     0.85  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 1300)
-     1.86±0.06ms      1.56±0.04ms     0.84  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 8000)
-         547±8μs          457±8μs     0.84  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 2000)
-      3.97±0.7μs       3.32±0.5μs     0.83  index_cached_properties.IndexCache.time_shape('PeriodIndex')
-     1.92±0.03ms      1.59±0.03ms     0.83  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 8000)
-     1.64±0.03ms      1.35±0.03ms     0.82  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 7000)
-      23.4±0.4ms         19.1±1ms     0.82  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 80000)
-     1.75±0.06ms      1.40±0.02ms     0.80  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 7000)
-      18.8±0.9ms       14.9±0.7ms     0.80  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 70000)
-        712±10ms          562±5ms     0.79  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 900000)
-         495±8ms         383±10ms     0.77  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 750000)
-         531±2ms          385±3ms     0.72  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 750000)
-        17.1±1ms       12.3±0.6ms     0.72  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 70000)
-         659±7ms          441±5ms     0.67  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 900000)
-      95.6±0.6ms       62.8±0.3ms     0.66  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, -2)
-       138±0.7ms       87.1±0.7ms     0.63  hash_functions.IsinWithArange.time_isin(<class 'object'>, 2000, 2)
-         110±1ms       62.6±0.7ms     0.57  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, 2)
-      72.2±0.3ms       12.0±0.3ms     0.17  hash_functions.IsinWithArange.time_isin(<class 'object'>, 1000, -2)
-         371±3ms         694±20μs     0.00  hash_functions.UniqueForLargePyObjectInts.time_unique

As always, when changing a hash functions, it is a mixed bag: for some cases the old hash-function was better, for some the new.

However, one can see, that the O(n^2) running time is avoided now:

-         371±3ms         694±20μs     0.00  hash_functions.UniqueForLargePyObjectInts.time_unique

@jreback jreback added the Performance Memory or execution speed performance label Feb 4, 2021
@jreback
Copy link
Contributor

jreback commented Feb 4, 2021

  •   123±0.4ms          367±6ms     2.99  series_methods.IsInLongSeriesValuesDominate.time_isin('object', 'monotone')
    
  •  47.9±0.6ms          103±2ms     2.16  hash_functions.IsinWithArangeSorted.time_isin(<class 'object'>, 1000000)
    

are these degenerate?

@realead
Copy link
Contributor Author

realead commented Feb 4, 2021

@jreback

After thinking about it, I have opted for a slighly different hash-reduction from 64bit->32bit. It now has the advantages that it fixes the O(n^2) issue but also makes otherwise minimal changes to the behavior in other cases, thus keeping a better original performance in some corner cases you have highlighted.

The timings are now

       before           after         ratio
     [dae99c72]       [2fe320b7]
+      5.60±0.3μs       7.04±0.9μs     1.26  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
+      10.5±0.6μs         12.2±2μs     1.16  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')
+      2.06±0.2μs       2.38±0.3μs     1.15  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+     3.64±0.05ms      4.20±0.09ms     1.15  series_methods.IsInForObjects.time_isin_long_series_long_values
+      3.82±0.2μs       4.39±0.5μs     1.15  index_cached_properties.IndexCache.time_values('TimedeltaIndex')
+      3.51±0.2μs       3.96±0.4μs     1.13  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
+      10.9±0.5μs       12.2±0.9μs     1.12  index_cached_properties.IndexCache.time_engine('UInt64Index')
+      4.85±0.4μs       5.37±0.3μs     1.11  index_cached_properties.IndexCache.time_shape('UInt64Index')
+        11.1±1μs       12.3±0.8μs     1.10  index_cached_properties.IndexCache.time_engine('Float64Index')
+         148±2ms          161±3ms     1.09  series_methods.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'random_hits')
+      6.28±0.5μs       6.82±0.6μs     1.09  index_cached_properties.IndexCache.time_is_all_dates('DatetimeIndex')
+      4.00±0.4μs       4.31±0.4μs     1.08  index_cached_properties.IndexCache.time_inferred_type('IntervalIndex')
+      12.1±0.7μs       12.9±0.7μs     1.07  index_cached_properties.IndexCache.time_is_all_dates('IntervalIndex')
+      9.87±0.6μs       10.5±0.4μs     1.06  index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')
+         129±2ms          136±2ms     1.05  series_methods.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'monotone_hits')
-        700±40ns         664±30ns     0.95  index_cached_properties.IndexCache.time_is_unique('RangeIndex')
-     1.01±0.06μs         955±50ns     0.95  index_cached_properties.IndexCache.time_is_monotonic_increasing('RangeIndex')
-     1.82±0.04ms      1.72±0.01ms     0.94  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 500000)
-     1.39±0.05μs      1.30±0.06μs     0.94  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
-      3.85±0.3μs       3.61±0.2μs     0.94  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
-         421±9ms          390±6ms     0.93  series_methods.IsInLongSeriesValuesDominate.time_isin('int32', 'random')
-     1.26±0.09μs      1.16±0.06μs     0.92  index_cached_properties.IndexCache.time_is_monotonic('RangeIndex')
-        739±50ns         678±30ns     0.92  index_cached_properties.IndexCache.time_is_unique('Int64Index')
-      3.95±0.3μs       3.61±0.2μs     0.91  index_cached_properties.IndexCache.time_values('Float64Index')
-      7.56±0.9μs       6.87±0.7μs     0.91  index_cached_properties.IndexCache.time_is_all_dates('PeriodIndex')
-      4.13±0.3μs       3.74±0.2μs     0.91  index_cached_properties.IndexCache.time_values('IntervalIndex')
-        773±30ns         692±30ns     0.89  index_cached_properties.IndexCache.time_inferred_type('Int64Index')
-        810±30ns         686±60ns     0.85  index_cached_properties.IndexCache.time_inferred_type('RangeIndex')
-        353±20μs         294±10μs     0.84  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 1300)
-        547±40μs         450±10μs     0.82  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 2000)
-      19.2±0.6ms       15.1±0.7ms     0.78  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 70000)
-     1.79±0.09ms      1.40±0.05ms     0.78  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 7000)
-     1.98±0.04ms      1.54±0.05ms     0.78  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 8000)
-        17.3±1ms       13.2±0.8ms     0.76  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 70000)
-        741±20ms         561±10ms     0.76  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 900000)
-        503±20ms          379±9ms     0.75  hash_functions.IsinWithRandomFloat.time_isin(<class 'object'>, 750000)
-         564±8ms         388±10ms     0.69  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 750000)
-         670±7ms          459±3ms     0.68  hash_functions.IsinWithRandomFloat.time_isin_outside(<class 'object'>, 900000)
-      73.2±0.8ms       11.9±0.2ms     0.16  hash_functions.IsinWithArange.time_isin(<class 'object'>, 1000, -2)
-      96.5±0.5ms       12.4±0.3ms     0.13  hash_functions.IsinWithArange.time_isin(<class 'object'>, 8000, -2)
-         107±1ms       12.0±0.5ms     0.11  hash_functions.IsinWithArange.time_isin(<class 'object'>, 2000, -2)
-         371±4ms         399±10μs     0.00  hash_functions.UniqueForLargePyObjectInts.time_unique

@jreback jreback added this to the 1.3 milestone Feb 4, 2021
@jreback
Copy link
Contributor

jreback commented Feb 4, 2021

ok @realead looks good!

@jreback
Copy link
Contributor

jreback commented Feb 5, 2021

can you add a whatsnew note (alt ok to just add this issue onto one of the previous ones for hashing). merge master and ping on greenish

@realead
Copy link
Contributor Author

realead commented Feb 6, 2021

@jreback green

@jreback jreback merged commit dbb88c7 into pandas-dev:master Feb 7, 2021
@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

thanks @realead

CyberQin pushed a commit to CyberQin/pandas that referenced this pull request Feb 8, 2021
@realead realead deleted the fix_gh_37615 branch February 9, 2021 07:17
realead added a commit to realead/pandas that referenced this pull request Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Hash-function doesn't take upper bits into consideration for PyObjects
2 participants