-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Stable Sorting Algorithm for Fillna Indexer #21212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @WillAyd! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 25, 2018 at 23:18 Hours UTC |
On review I don't think this is a bug with NumPy - the default quicksort implementation is just not stable (and documented as such), but the code here assumes it. Learned something new today... FWIW there are a few instances in |
Codecov Report
@@ Coverage Diff @@
## master #21212 +/- ##
=======================================
Coverage 91.84% 91.84%
=======================================
Files 153 153
Lines 49505 49505
=======================================
Hits 45466 45466
Misses 4039 4039
Continue to review full report at Codecov.
|
ASVs below - I obviously didn't improve anything here but seems in line with all other methods before after ratio
[dc02831f] [f6abb614]
- 418±7μs 380±5μs 0.91 groupby.GroupByMethods.time_dtype_as_field('int', 'prod', 'transformation')
- 455±10μs 410±7μs 0.90 groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'direct')
- 180±3μs 162±6μs 0.90 groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'direct')
- 158±0.6μs 141±4μs 0.89 groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'direct')
- 251±8μs 221±4μs 0.88 groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'transformation')
- 471±3μs 414±5μs 0.88 groupby.GroupByMethods.time_dtype_as_field('float', 'nunique', 'transformation')
- 210±3μs 183±9μs 0.88 groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'direct')
- 99.2±2μs 86.3±0.7μs 0.87 groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'direct')
- 251±10μs 217±5μs 0.87 groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'direct')
- 285±8μs 246±1μs 0.86 groupby.GroupByMethods.time_dtype_as_field('float', 'bfill', 'transformation')
- 287±3μs 246±0.8μs 0.86 groupby.GroupByMethods.time_dtype_as_field('float', 'bfill', 'direct')
- 229±4μs 196±7μs 0.85 groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'direct')
- 169±4μs 143±10μs 0.85 groupby.GroupByMethods.time_dtype_as_field('float', 'max', 'transformation')
- 150±1μs 126±10μs 0.84 groupby.GroupByMethods.time_dtype_as_field('float', 'last', 'transformation')
- 184±2ms 154±0.6ms 0.84 groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct')
- 101±1μs 84.1±0.3μs 0.83 groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation')
- 475±9μs 396±20μs 0.83 groupby.GroupByMethods.time_dtype_as_field('float', 'nunique', 'direct')
- 253±1μs 211±1μs 0.83 groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'direct')
- 99.0±4μs 82.4±0.8μs 0.83 groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'transformation')
- 603ms 501±0.03ms 0.83 groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'transformation')
- 622±7μs 515±30μs 0.83 groupby.GroupByMethods.time_dtype_as_field('float', 'cummin', 'direct')
- 225±5μs 186±3μs 0.83 groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'transformation')
- 140±0.9ms 115±0.8ms 0.82 groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'direct')
- 464±10μs 380±3μs 0.82 groupby.GroupByMethods.time_dtype_as_field('int', 'tail', 'transformation')
- 130±1μs 106±8μs 0.82 groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'transformation')
- 100±1μs 81.2±0.2μs 0.81 groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'direct')
- 1.23±0.01ms 1000±10μs 0.81 groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
- 237±5μs 191±3μs 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'transformation')
- 235±3μs 189±3μs 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'direct')
- 360±4μs 289±4μs 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'transformation')
- 257±2μs 206±2μs 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'transformation')
- 330±5μs 265±3μs 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct')
- 189±5ms 151±2ms 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'transformation')
- 485±20μs 388±20μs 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'nunique', 'transformation')
- 438±4μs 349±1μs 0.80 groupby.GroupByMethods.time_dtype_as_field('int', 'sum', 'direct')
- 100±2μs 79.6±0.8μs 0.79 groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'transformation')
- 95.6±3μs 75.9±5μs 0.79 groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'direct')
- 141±3ms 112±0.6ms 0.79 groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'transformation')
- 603ms 478±0.4ms 0.79 groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'direct')
- 1.26±0.03ms 997±5μs 0.79 groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
- 447±20μs 351±2μs 0.79 groupby.GroupByMethods.time_dtype_as_field('int', 'sum', 'transformation')
- 210±3μs 164±9μs 0.78 groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'transformation')
- 168±5μs 131±4μs 0.78 groupby.GroupByMethods.time_dtype_as_field('float', 'min', 'direct')
- 477±10μs 371±4μs 0.78 groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'direct')
- 93.1±1μs 72.3±4μs 0.78 groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'direct')
- 335±2μs 260±3μs 0.77 groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
- 736±10μs 565±10μs 0.77 groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'direct')
- 200±2μs 153±2μs 0.77 groupby.GroupByMethods.time_dtype_as_field('float', 'median', 'direct')
- 295±4μs 225±1μs 0.76 groupby.GroupByMethods.time_dtype_as_field('float', 'ffill', 'direct')
- 281±5μs 213±1μs 0.76 groupby.GroupByMethods.time_dtype_as_field('float', 'ffill', 'transformation')
- 161±1μs 122±7μs 0.76 groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'transformation')
- 358±4μs 270±3μs 0.75 groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'direct')
- 400±8μs 301±2μs 0.75 groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'direct')
- 486±5μs 365±2μs 0.75 groupby.GroupByMethods.time_dtype_as_field('int', 'nunique', 'direct')
- 922±20μs 690±10μs 0.75 groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'direct')
- 471±10μs 351±5μs 0.75 groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'transformation')
- 1.03±0.02ms 769±7μs 0.75 groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
- 131±2μs 97.7±4μs 0.75 groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct')
- 719±20μs 535±2μs 0.74 groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'transformation')
- 1.02±0.01ms 759±6μs 0.74 groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation')
- 269±7μs 200±0.8μs 0.74 groupby.GroupByMethods.time_dtype_as_field('float', 'cumcount', 'transformation')
- 917±7μs 680±3μs 0.74 groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'transformation')
- 352±2μs 260±0.5μs 0.74 groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'direct')
- 261±8μs 193±0.6μs 0.74 groupby.GroupByMethods.time_dtype_as_field('int', 'cumcount', 'transformation')
- 358±2μs 264±6μs 0.74 groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'transformation')
- 156±6μs 115±2μs 0.73 groupby.GroupByMethods.time_dtype_as_field('float', 'mean', 'transformation')
- 634±10μs 465±2μs 0.73 groupby.GroupByMethods.time_dtype_as_field('float', 'cumprod', 'transformation')
- 676±20μs 490±10μs 0.73 groupby.GroupByMethods.time_dtype_as_field('float', 'cumsum', 'direct')
- 224±1μs 163±1μs 0.72 groupby.GroupByMethods.time_dtype_as_field('int', 'first', 'direct')
- 741±10μs 536±10μs 0.72 groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'transformation')
- 222±3μs 160±0.7μs 0.72 groupby.GroupByMethods.time_dtype_as_field('int', 'first', 'transformation')
- 102±2μs 73.4±0.2μs 0.72 groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'direct')
- 644±20μs 465±2μs 0.72 groupby.GroupByMethods.time_dtype_as_field('float', 'cumsum', 'transformation')
- 411±4μs 296±2μs 0.72 groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'transformation')
- 150±3μs 108±0.3μs 0.72 groupby.GroupByMethods.time_dtype_as_field('float', 'last', 'direct')
- 103±0.6μs 73.7±0.8μs 0.72 groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation')
- 595ms 424±4ms 0.71 groupby.GroupByMethods.time_dtype_as_field('float', 'mad', 'direct')
- 713±10μs 508±30μs 0.71 groupby.GroupByMethods.time_dtype_as_field('int', 'cummax', 'direct')
- 162±1μs 115±0.7μs 0.71 groupby.GroupByMethods.time_dtype_as_field('float', 'first', 'direct')
- 1.84s 1.30s 0.71 groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')
- 184±3μs 130±2μs 0.71 groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'transformation')
- 1.83s 1.29s 0.71 groupby.GroupByMethods.time_dtype_as_field('float', 'describe', 'transformation')
- 162±4μs 115±1μs 0.71 groupby.GroupByMethods.time_dtype_as_field('float', 'first', 'transformation')
- 637±10μs 450±6μs 0.71 groupby.GroupByMethods.time_dtype_as_field('float', 'cummax', 'direct')
- 666±10μs 470±5μs 0.71 groupby.GroupByMethods.time_dtype_as_field('float', 'cumprod', 'direct')
- 273±4μs 192±2μs 0.70 groupby.GroupByMethods.time_dtype_as_field('int', 'cumcount', 'direct')
- 1.84s 1.29s 0.70 groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'direct')
- 476±10μs 333±0.6μs 0.70 groupby.GroupByMethods.time_dtype_as_field('int', 'head', 'direct')
- 447±3μs 311±0.8μs 0.70 groupby.GroupByMethods.time_dtype_as_field('float', 'head', 'transformation')
- 456±10μs 315±1μs 0.69 groupby.GroupByMethods.time_dtype_as_field('float', 'head', 'direct')
- 600ms 414±2ms 0.69 groupby.GroupByMethods.time_dtype_as_field('float', 'mad', 'transformation')
- 1.85s 1.27s 0.69 groupby.GroupByMethods.time_dtype_as_field('float', 'describe', 'direct')
- 166±2μs 113±2μs 0.68 groupby.GroupByMethods.time_dtype_as_field('float', 'mean', 'direct')
- 214±2μs 143±1μs 0.67 groupby.GroupByMethods.time_dtype_as_field('float', 'median', 'transformation')
- 1.08±0.08ms 565±60μs 0.52 groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
- 1.35±0.2ms 605±8μs 0.45 groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'transformation')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
thanks @WillAyd |
(cherry picked from commit d30cc74)
(cherry picked from commit d30cc74)
git diff upstream/master -u -- "*.py" | flake8 --diff
Haven't added a whatsnew yet if only because I need to think through the best way to phrase this / investigate where it doesn't work. I have a feeling the default
kind
argument has a bug in it back in NumPy, butmergesort
is guaranteed as a stable sort so may be preferable anyway in spite of the performance hit.https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html#numpy.sort
Will post ASVs later as well - let me know of feedback in the interim