-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Remove inline from cython classes #49873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove inline from cython classes #49873
Conversation
FWIW gcc has an https://gcc.gnu.org/onlinedocs/gcc-12.1.0/gcc/Optimize-Options.html But right now I think that gets clobbered by the amount of manually specified inlines we do. There's a mixed bag of errors but a lot are |
Similar function for CPython discourages its use: https://docs.python.org/3/c-api/intro.html#c.Py_ALWAYS_INLINE
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there no more warnings in these spots where you removed inline?
Also do % timit
s show no performance change in the relevant code paths?
Wasn't sure the best way to do a timeit so went a little more extreme, removed as many inline statements as possible (couldn't remove them from util.pxd) and ran the full test suite overnight. Results are below - I think any changes are spurious. before after ratio
[82d271fe] [9d9cca88]
<main> <remove-class-level-inline>
+ 1.52±0.01ms 3.17±1ms 2.08 gil.ParallelRolling.time_rolling('mean')
+ 130±3μs 194±1μs 1.49 index_cached_properties.IndexCache.time_is_monotonic_increasing('CategoricalIndex')
+ 133±6μs 194±0.8μs 1.46 index_cached_properties.IndexCache.time_is_monotonic_decreasing('CategoricalIndex')
+ 203±0.6μs 269±1μs 1.33 categoricals.Indexing.time_intersection
+ 1.67±0.07ms 2.08±0.5ms 1.25 index_cached_properties.IndexCache.time_is_unique('MultiIndex')
+ 12.5±0.04ms 15.6±4ms 1.24 rolling.Rank.time_rank('Series', 10, 'int', True, True, 'average')
+ 1.41±0.01ms 1.73±0.01ms 1.23 timeseries.Lookup.time_lookup_and_cleanup
+ 98.1±0.3μs 119±0.5μs 1.21 index_cached_properties.IndexCache.time_is_monotonic_decreasing('PeriodIndex')
+ 98.2±0.4μs 119±0.5μs 1.21 index_cached_properties.IndexCache.time_is_monotonic_increasing('PeriodIndex')
+ 100±0.1μs 121±0.1μs 1.21 index_cached_properties.IndexCache.time_is_monotonic_decreasing('DatetimeIndex')
+ 99.3±0.7μs 120±0.4μs 1.21 index_cached_properties.IndexCache.time_is_monotonic_decreasing('TimedeltaIndex')
+ 1.46±0ms 1.76±0.3ms 1.21 index_cached_properties.IndexCache.time_is_unique('Float64Index')
+ 100.0±0.2μs 120±0.3μs 1.20 index_cached_properties.IndexCache.time_is_monotonic_increasing('DatetimeIndex')
+ 99.5±0.5μs 120±0.4μs 1.20 index_cached_properties.IndexCache.time_is_monotonic_increasing('TimedeltaIndex')
+ 949±5μs 1.11±0ms 1.17 rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'sum')
+ 1.65±0ms 1.91±0.01ms 1.16 algorithms.Hashing.time_series_categorical
+ 1.02±0ms 1.17±0.02ms 1.15 rolling.ForwardWindowMethods.time_rolling('Series', 10, 'float', 'mean')
+ 983±4μs 1.13±0ms 1.15 rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'mean')
+ 926±4μs 1.06±0ms 1.15 rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'float', 'sum')
+ 929±2μs 1.07±0ms 1.15 rolling.ForwardWindowMethods.time_rolling('Series', 10, 'float', 'sum')
+ 965±2μs 1.11±0.01ms 1.15 rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'int', 'sum')
+ 1.01±0ms 1.15±0.01ms 1.14 rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'float', 'sum')
+ 1.04±0.01ms 1.19±0.01ms 1.14 rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'int', 'sum')
+ 1000±4μs 1.14±0ms 1.14 rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'float', 'sum')
+ 993±2μs 1.13±0.01ms 1.14 rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'int', 'mean')
+ 1.01±0ms 1.15±0ms 1.14 rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'float', 'sum')
+ 1.05±0.01ms 1.20±0.01ms 1.14 rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'int', 'sum')
+ 1.09±0ms 1.24±0.01ms 1.14 rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'float', 'mean')
+ 1.08±0.01ms 1.22±0.01ms 1.13 rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'int', 'mean')
+ 1.08±0ms 1.22±0.01ms 1.13 rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'int', 'mean')
+ 1.04±0ms 1.17±0ms 1.13 rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'int', 'sum')
+ 791±6μs 892±6μs 1.13 rolling.Methods.time_method('Series', ('expanding', {}), 'float', 'sum')
+ 1.01±0ms 1.14±0ms 1.13 rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'float', 'sum')
+ 1.03±0.01ms 1.16±0ms 1.13 rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'float', 'mean')
+ 823±5μs 926±5μs 1.13 rolling.Methods.time_method('Series', ('expanding', {}), 'int', 'sum')
+ 837±3μs 940±3μs 1.12 rolling.Methods.time_method('Series', ('expanding', {}), 'int', 'mean')
+ 1.04±0.01ms 1.17±0ms 1.12 rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'int', 'sum')
+ 1.11±0ms 1.24±0ms 1.12 rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'float', 'mean')
+ 1.13±0.01ms 1.27±0.01ms 1.12 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'int', 'sum')
+ 1.09±0ms 1.22±0ms 1.12 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'float', 'sum')
+ 803±3μs 899±3μs 1.12 rolling.Methods.time_method('Series', ('expanding', {}), 'float', 'mean')
+ 1.21±0ms 1.36±0.01ms 1.12 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'int', 'mean')
+ 1.09±0ms 1.22±0.01ms 1.12 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'float', 'sum')
+ 965±20μs 1.08±0.06ms 1.12 period.PeriodIndexConstructor.time_from_pydatetime('D', False)
+ 1.32±0ms 1.48±0.01ms 1.12 rolling.VariableWindowMethods.time_method('Series', '50s', 'int', 'mean')
+ 1.13±0ms 1.26±0ms 1.12 rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'int', 'mean')
+ 1.13±0.01ms 1.26±0.01ms 1.12 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'int', 'sum')
+ 1.10±0ms 1.23±0ms 1.12 rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'float', 'mean')
+ 1.11±0.01ms 1.24±0.01ms 1.12 rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'float', 'mean')
+ 882±3μs 985±8μs 1.12 rolling.Methods.time_method('DataFrame', ('expanding', {}), 'float', 'sum')
+ 1.18±0.01ms 1.32±0.01ms 1.12 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'float', 'mean')
+ 1.14±0.01ms 1.27±0ms 1.12 rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'int', 'mean')
+ 1.19±0ms 1.33±0ms 1.11 rolling.VariableWindowMethods.time_method('Series', '1h', 'float', 'sum')
+ 854±30ns 951±100ns 1.11 index_cached_properties.IndexCache.time_values('DatetimeIndex')
+ 1.18±0ms 1.31±0ms 1.11 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'float', 'mean')
+ 1.22±0ms 1.36±0.01ms 1.11 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'int', 'mean')
+ 1.24±0.02ms 1.38±0.02ms 1.11 rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'float', 'count')
+ 1.38±0ms 1.53±0ms 1.11 rolling.VariableWindowMethods.time_method('DataFrame', '1h', 'float', 'mean')
+ 1.19±0ms 1.32±0ms 1.11 rolling.VariableWindowMethods.time_method('Series', '1d', 'float', 'sum')
+ 1.67±0ms 1.85±0.01ms 1.11 rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'kurt')
+ 1.34±0.02ms 1.48±0.01ms 1.11 rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'float', 'count')
+ 1.24±0ms 1.37±0.01ms 1.11 rolling.VariableWindowMethods.time_method('Series', '50s', 'int', 'sum')
+ 1.32±0ms 1.47±0.01ms 1.11 rolling.VariableWindowMethods.time_method('Series', '1h', 'int', 'mean')
+ 1.23±0.01ms 1.36±0ms 1.11 rolling.VariableWindowMethods.time_method('Series', '1d', 'int', 'sum')
+ 1.42±0ms 1.57±0ms 1.11 rolling.VariableWindowMethods.time_method('DataFrame', '1h', 'int', 'mean')
+ 558±1μs 618±1μs 1.11 reindex.ReindexMethod.time_reindex_method('pad', <function period_range>)
+ 1.30±0ms 1.43±0ms 1.11 rolling.VariableWindowMethods.time_method('DataFrame', '50s', 'float', 'sum')
+ 961±4μs 1.06±0ms 1.11 rolling.Methods.time_method('Series', ('expanding', {}), 'float', 'std')
+ 553±2μs 612±1μs 1.11 reindex.ReindexMethod.time_reindex_method('pad', <function date_range>)
+ 1.24±0.01ms 1.37±0.02ms 1.11 rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'float', 'count')
+ 1.29±0ms 1.43±0.01ms 1.10 rolling.VariableWindowMethods.time_method('Series', '50s', 'float', 'mean')
+ 926±8μs 1.02±0ms 1.10 rolling.Methods.time_method('DataFrame', ('expanding', {}), 'int', 'mean')
+ 1.28±0ms 1.41±0ms 1.10 rolling.VariableWindowMethods.time_method('Series', '1d', 'float', 'mean')
+ 1.33±0.01ms 1.47±0ms 1.10 rolling.VariableWindowMethods.time_method('DataFrame', '50s', 'int', 'sum')
+ 894±1μs 984±2μs 1.10 rolling.Methods.time_method('DataFrame', ('expanding', {}), 'float', 'mean')
+ 920±6μs 1.01±0.01ms 1.10 rolling.Methods.time_method('DataFrame', ('expanding', {}), 'int', 'sum')
+ 1.31±0ms 1.44±0ms 1.10 rolling.VariableWindowMethods.time_method('Series', '1d', 'int', 'mean')
+ 1.25±0.01ms 1.37±0ms 1.10 rolling.VariableWindowMethods.time_method('Series', '1h', 'int', 'sum')
+ 1.34±0ms 1.47±0ms 1.10 rolling.VariableWindowMethods.time_method('DataFrame', '1h', 'int', 'sum')
- 181±0.3μs 165±1μs 0.91 index_cached_properties.IndexCache.time_is_monotonic_increasing('UInt64Index')
- 82.4±10ms 74.6±0.5ms 0.91 algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.float64'>, 1000000)
- 2.24±0.08ms 2.02±0ms 0.90 frame_methods.Fillna.time_frame_fillna(False, 'pad', 'float32')
- 1.45±0.5μs 1.31±0μs 0.90 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'month', None, 12)
- 749±30μs 672±30μs 0.90 dtypes.SelectDtypes.time_select_dtype_string_exclude('Int8')
- 653±30μs 585±20μs 0.90 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ge>)
- 473±1μs 421±10μs 0.89 dtypes.SelectDtypes.time_select_dtype_int_exclude('uint32')
- 454±20μs 401±5μs 0.88 dtypes.SelectDtypes.time_select_dtype_int_exclude('complex64')
- 454±1μs 400±10μs 0.88 dtypes.SelectDtypes.time_select_dtype_int_include(<class 'int'>)
- 11.1±1ms 9.77±0.03ms 0.88 arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function gt>, (10000, 1000))
- 3.16±0.02ms 2.78±0.2ms 0.88 frame_methods.Fillna.time_frame_fillna(True, 'pad', 'float64')
- 3.00±0.1ms 2.58±0.01ms 0.86 frame_methods.Fillna.time_frame_fillna(True, 'pad', 'float32')
- 3.03±0.1ms 2.59±0.01ms 0.85 frame_methods.Fillna.time_frame_fillna(True, 'bfill', 'float32')
- 14.7±0.4ms 12.4±0.2ms 0.84 multiindex_object.SetOperations.time_operation('monotonic', 'int', 'intersection', None)
- 768±30μs 640±20μs 0.83 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ne>)
- 703±100μs 577±20μs 0.82 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function eq>)
- 762±60μs 615±30μs 0.81 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ge>)
- 176±3μs 137±0.2μs 0.78 index_cached_properties.IndexCache.time_is_monotonic_increasing('Float64Index')
- 2.22±0.3μs 1.73±0.1μs 0.78 index_cached_properties.IndexCache.time_shape('CategoricalIndex')
- 1.77±0.3μs 1.37±0.1μs 0.77 index_cached_properties.IndexCache.time_values('CategoricalIndex')
- 178±3μs 137±0.3μs 0.77 index_cached_properties.IndexCache.time_is_monotonic_decreasing('Float64Index')
- 972±200μs 562±20μs 0.58 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function lt>)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED. As far as warnings go, this gets us down a few thousand, though there's still 150k+ lines of warnings. I don't know we will ever be able to use Winline since I think Cython uses it a lot. |
mind double-checking the worst 2ish with timeit? |
does this change the build sizes (eg wheels)? |
Wasn't sure how to use timeit for the first benchmark since it measures parallel execution, but a subsequent run of just that yields no changes: asv continuous -f 1.1 upstream/main HEAD -b gil.ParallelRolling.time_rolling
[100.00%] ··· gil.ParallelRolling.time_rolling ok
[100.00%] ··· ======== =============
method
-------- -------------
median 43.2±1ms
mean 1.53±0.03ms
min 2.24±0.01ms
max 2.24±0.02ms
var 2.00±0.03ms
skew 2.75±0.05ms
kurt 3.04±0.02ms
std 2.16±0.01ms
======== =============
BENCHMARKS NOT SIGNIFICANTLY CHANGED. Here's the next few entries on this branch:
versus main 95.5 ns ± 0.888 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
99.2 ns ± 0.836 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) |
Looks to be 24kb larger on this branch than main |
@@ -1082,13 +1067,13 @@ def roll_min(ndarray[float64_t] values, ndarray[int64_t] start, | |||
return _roll_min_max(values, start, end, minp, is_max=0) | |||
|
|||
|
|||
cdef _roll_min_max(ndarray[numeric_t] values, | |||
cdef _roll_min_max(ndarray[float64_t] values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly we have type specialization built into this module but only ever call these functions with float64_t arrays. -Wunused-function
does not warn about unused inline static functions so this went undetected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, yeah there's an issue to have rolling ops not always cast & return float64 but better to remove this for now
thanks for double-checking. this seems benign to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a merge conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. cc @jbrockmendel merge when ready
thanks @WillAyd |
Part of #33926 adding
-Winline
to the compiler options for me generates a 14 MB file with 157,000 lines of warnings about how functions aren't getting inlined.I doubt that any of our Cython inline functions actually get inlined, but figured these are easiest to start with. The cython documentation itself states that class-level cdef functions have almost no chance of being inlined
https://cython.readthedocs.io/en/stable/src/userguide/pyrex_differences.html?highlight=inline#cdef-inline