REF: de-duplicate code for pad/backfill #40306

simonjayhawkins · 2021-03-08T14:21:10Z

       before           after         ratio
     [5ef41593]       [6a7ac9e4]
     <master>         <interpolate_2d>
+     3.44±0.04ms       5.14±0.4ms     1.50  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'Float64')
+      3.48±0.1ms       5.06±0.3ms     1.45  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'Int64')
+     3.32±0.05ms      4.59±0.05ms     1.38  frame_methods.Fillna.time_frame_fillna(False, 'bfill', 'Int64')
+     3.30±0.05ms       4.42±0.1ms     1.34  frame_methods.Fillna.time_frame_fillna(False, 'bfill', 'Float64')
+     3.37±0.03ms       4.43±0.1ms     1.32  frame_methods.Fillna.time_frame_fillna(False, 'pad', 'Int64')
+     3.34±0.05ms      4.30±0.08ms     1.29  frame_methods.Fillna.time_frame_fillna(False, 'pad', 'Float64')
+     2.15±0.04ms       2.48±0.2ms     1.15  frame_methods.Fillna.time_frame_fillna(True, 'bfill', 'Float64')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

if the general idea of this PR is acceptable, could try to improve these but these were already significantly improved in #39953.

with the cython functions reduced from 4 to 1, it is feasible to add 1 dedicated function for no limit (my numba branch is quicker than cython with that change) instead of 4 extra functions.

also may pave the way to add limit area and limit direction support to EAs

and maybe also for 2D masked arrays? cc @jbrockmendel

also having 1 cython function, I think also paves the way for a cython version for the limit area to remove the python set logic could be investigated.

jreback · 2021-03-08T14:37:47Z

looks good, +1 on this general idea to unify things.

pandas/core/missing.py

jbrockmendel · 2021-03-08T15:33:12Z

pandas/core/missing.py


-def _datetimelike_compat(func: F) -> F:


can we keep datetimelike_compat and apply it to interpolate_2d (i learned in #40294 that ATM interpolate_2d doesnt need it ATM bc it the wrapper is used in the functions that this PR removes)

can do. although as stated recently from my experience with experimenting with numba, these wrappers would probably be best in _libs adjacent to the cython code to give a cleaner api and remove conversions needed for the cython code from the python code

im looking forward to seeing this, increasingly positive on the idea of doing more numba and less cython

slow down. not discussed this yet. My primary motivations are no_arch releases, fully typed libraries without stubs and utilization of more cpu cores and maybe gpu with standard pandas (AFAIK cudf/rapids are planning to rely on wsl2 in order to support Windows and not support Windows natively).

I've only just started, so don't know if this is a realistic goal or whether a more incremental approach of using numba when installed for parallelization is more feasible.

i reserve the right to be enthused about this idea

sure. i'm enthusiastic too at this early stage...

Inline typing compatible with mypy

Faster local development cycle

Simpler development environment for new contributors

No wheel building, maybe faster release cycles

Maybe can add some basic gpu support in future

Benefits of numba

Less complex code

No need for some prep functions, casting, asi8, ensure_int/float etc

Style checks apply to library code.

Let numba handle types and type conversions

Let numba handle memory allocation and deallocation

for the basic functions i've looked at so far, the performance of numba seems to be comparable to cython when comparing to a pure python implementation (but i've not touched anything yet where we have C classes)

oh numba perforance can often be much better than cython, and usually not less than. The issue we have is that we cannot move much of the core functions as we do not have a strict dependency at this point. we can certainly revisit things going forward.

jorisvandenbossche · 2021-03-10T14:41:12Z

@simonjayhawkins does the slowdown come from the cython changes (removing the 1D version) or from other changes?

simonjayhawkins · 2021-03-10T14:56:07Z

@simonjayhawkins does the slowdown come from the cython changes (removing the 1D version) or from other changes?

not profiled yet. (the slowdown is neglible compared to the object conversion that was done up to a few days ago!)

I will profile shortly.

but if due to the cython code, could consolidate around the 1d versions and use np.along_axis for 2d instead.

I consolidated around the 2d here since I think that can be parallelized with numba, but that shouldn't be a consideration here.

there was no performance impact upto the last commit so may split those changes into a separate PR to keep things manageable.

I could be that some extra checking in interploated_2d is responsible for the slowdown. I found that it impacts performance calling pd.isna on the mask after transformations compared to before transformations.

jorisvandenbossche · 2021-03-10T15:11:08Z

(the slowdown is neglible compared to the object conversion that was done up to a few days ago!)

Yes, but the baseline for comparison should probably not be the slow object conversion way from before, but comparing with numpy numeric dtypes (since it's the goal to replace eg float64 with Float64)

simonjayhawkins · 2021-03-10T15:18:17Z

Yes, but the baseline for comparison should probably not be the slow object conversion way from before, but comparing with numpy numeric dtypes (since it's the goal to replace eg float64 with Float64)

agreed. I don't expect this PR to be merged/acceptable with the performance degradation.

when changing to not do the object conversion, we always update the mask in the 1d version and this will also be needed in the 2d version if masked arrays are to be backed by 2d arrays unless we make the mask update conditional or change that strategy.

simonjayhawkins · 2021-03-21T15:03:31Z

Yes, but the baseline for comparison should probably not be the slow object conversion way from before, but comparing with numpy numeric dtypes (since it's the goal to replace eg float64 with Float64)

closing for now to clear queue. will revisit after #38992.

simonjayhawkins · 2021-03-21T16:04:24Z

closing for now to clear queue. will revisit after #38992.

while experimenting with numba, #40530, i'm periodically running the benchmark suite. I don't know how best to do this as some benchmarks show a significant degradation that are likely due to compilation as testing the functions in jupyter notebook tend to show similar performance to Cython. However, the benchmarks that show a performance improvement likely indicate areas where improvements to the cython code could be investigated.

This PR is part of that process. Below is the benchmarks that are improved with #40530 (as of yesterday).

As I make progress with #40530. This output will change and the priorities will undoubtedly change.

-      10.7±0.2ms       9.65±0.3ms     0.90  join_merge.Concat.time_concat_series(0)
-      21.7±0.4ms       19.4±0.1ms     0.90  algos.isin.IsInLongSeriesLookUpDominates.time_isin('float32', 5, 'monotone_hits')
-      7.50±0.7μs       6.70±0.3μs     0.89  index_cached_properties.IndexCache.time_engine('UInt64Index')
-      23.9±0.1ms         21.3±1ms     0.89  groupby.Nth.time_series_nth_all('float32')
-      24.0±0.2ms       21.3±0.9ms     0.89  groupby.Nth.time_series_nth_any('float32')
-     3.74±0.02ms      3.09±0.04ms     0.83  frame_methods.Fillna.time_frame_fillna(False, 'pad', 'Int64')
-     3.76±0.08ms      3.01±0.03ms     0.80  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'Int64')
-     22.2±0.06ms       17.5±0.2ms     0.79  reshape.Crosstab.time_crosstab
-     4.10±0.03ms      3.23±0.04ms     0.79  frame_methods.Fillna.time_frame_fillna(False, 'bfill', 'Int64')
-     4.11±0.02ms      3.22±0.09ms     0.78  frame_methods.Fillna.time_frame_fillna(False, 'bfill', 'Float64')
-     3.89±0.04ms      2.99±0.05ms     0.77  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'Float64')
-      3.30±0.8μs       2.30±0.2μs     0.70  index_cached_properties.IndexCache.time_values('UInt64Index')
-      5.81±0.1ms      3.70±0.03ms     0.64  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'datetime64[ns, tz]')

simonjayhawkins added 5 commits March 7, 2021 19:34

wip

1ca6b32

wip

8440b4e

fix failing test

4d2792e

wip

527336d

use interpolate_2d instead of get_fill_func

6a7ac9e

simonjayhawkins added Refactor Internal refactoring of code Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 8, 2021

jbrockmendel reviewed Mar 8, 2021

View reviewed changes

pandas/core/missing.py Show resolved Hide resolved

jbrockmendel reviewed Mar 8, 2021

View reviewed changes

simonjayhawkins marked this pull request as draft March 10, 2021 15:19

simonjayhawkins closed this Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: de-duplicate code for pad/backfill #40306

REF: de-duplicate code for pad/backfill #40306

simonjayhawkins commented Mar 8, 2021

jreback commented Mar 8, 2021

jbrockmendel Mar 8, 2021

simonjayhawkins Mar 8, 2021

jbrockmendel Mar 8, 2021

simonjayhawkins Mar 8, 2021

jbrockmendel Mar 8, 2021

simonjayhawkins Mar 8, 2021

jreback Mar 9, 2021

jorisvandenbossche commented Mar 10, 2021

simonjayhawkins commented Mar 10, 2021

jorisvandenbossche commented Mar 10, 2021

simonjayhawkins commented Mar 10, 2021

simonjayhawkins commented Mar 21, 2021

simonjayhawkins commented Mar 21, 2021


		def _datetimelike_compat(func: F) -> F:

REF: de-duplicate code for pad/backfill #40306

REF: de-duplicate code for pad/backfill #40306

Conversation

simonjayhawkins commented Mar 8, 2021

jreback commented Mar 8, 2021

jbrockmendel Mar 8, 2021

Choose a reason for hiding this comment

simonjayhawkins Mar 8, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 8, 2021

Choose a reason for hiding this comment

simonjayhawkins Mar 8, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 8, 2021

Choose a reason for hiding this comment

simonjayhawkins Mar 8, 2021

Choose a reason for hiding this comment

jreback Mar 9, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 10, 2021

simonjayhawkins commented Mar 10, 2021

jorisvandenbossche commented Mar 10, 2021

simonjayhawkins commented Mar 10, 2021

simonjayhawkins commented Mar 21, 2021

simonjayhawkins commented Mar 21, 2021