Skip to content

Stable Sorting Algorithm for Fillna Indexer #21212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 29, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented May 25, 2018

Haven't added a whatsnew yet if only because I need to think through the best way to phrase this / investigate where it doesn't work. I have a feeling the default kind argument has a bug in it back in NumPy, but mergesort is guaranteed as a stable sort so may be preferable anyway in spite of the performance hit.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html#numpy.sort

Will post ASVs later as well - let me know of feedback in the interim

@pep8speaks
Copy link

pep8speaks commented May 25, 2018

Hello @WillAyd! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 25, 2018 at 23:18 Hours UTC

@WillAyd
Copy link
Member Author

WillAyd commented May 25, 2018

On review I don't think this is a bug with NumPy - the default quicksort implementation is just not stable (and documented as such), but the code here assumes it. Learned something new today...

FWIW there are a few instances in hashtable.pyx and index.pyx which use the default quicksort implementation. Not saying these are an issue but may be worth closer review if we notice strange behavior elsewhere

@codecov
Copy link

codecov bot commented May 25, 2018

Codecov Report

Merging #21212 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21212   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49505    49505           
=======================================
  Hits        45466    45466           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.23% <ø> (ø) ⬆️
#single 41.88% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6abb61...617ae08. Read the comment docs.

@WillAyd WillAyd added Groupby Bug Regression Functionality that used to work in a prior pandas version labels May 25, 2018
@WillAyd
Copy link
Member Author

WillAyd commented May 26, 2018

ASVs below - I obviously didn't improve anything here but seems in line with all other methods

       before           after         ratio
     [dc02831f]       [f6abb614]
-         418±7μs          380±5μs     0.91  groupby.GroupByMethods.time_dtype_as_field('int', 'prod', 'transformation')
-        455±10μs          410±7μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'direct')
-         180±3μs          162±6μs     0.90  groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'direct')
-       158±0.6μs          141±4μs     0.89  groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'direct')
-         251±8μs          221±4μs     0.88  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'transformation')
-         471±3μs          414±5μs     0.88  groupby.GroupByMethods.time_dtype_as_field('float', 'nunique', 'transformation')
-         210±3μs          183±9μs     0.88  groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'direct')
-        99.2±2μs       86.3±0.7μs     0.87  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'direct')
-        251±10μs          217±5μs     0.87  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'direct')
-         285±8μs          246±1μs     0.86  groupby.GroupByMethods.time_dtype_as_field('float', 'bfill', 'transformation')
-         287±3μs        246±0.8μs     0.86  groupby.GroupByMethods.time_dtype_as_field('float', 'bfill', 'direct')
-         229±4μs          196±7μs     0.85  groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'direct')
-         169±4μs         143±10μs     0.85  groupby.GroupByMethods.time_dtype_as_field('float', 'max', 'transformation')
-         150±1μs         126±10μs     0.84  groupby.GroupByMethods.time_dtype_as_field('float', 'last', 'transformation')
-         184±2ms        154±0.6ms     0.84  groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct')
-         101±1μs       84.1±0.3μs     0.83  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation')
-         475±9μs         396±20μs     0.83  groupby.GroupByMethods.time_dtype_as_field('float', 'nunique', 'direct')
-         253±1μs          211±1μs     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'direct')
-        99.0±4μs       82.4±0.8μs     0.83  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'transformation')
-           603ms       501±0.03ms     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'transformation')
-         622±7μs         515±30μs     0.83  groupby.GroupByMethods.time_dtype_as_field('float', 'cummin', 'direct')
-         225±5μs          186±3μs     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'transformation')
-       140±0.9ms        115±0.8ms     0.82  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'direct')
-        464±10μs          380±3μs     0.82  groupby.GroupByMethods.time_dtype_as_field('int', 'tail', 'transformation')
-         130±1μs          106±8μs     0.82  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'transformation')
-         100±1μs       81.2±0.2μs     0.81  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'direct')
-     1.23±0.01ms        1000±10μs     0.81  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
-         237±5μs          191±3μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'transformation')
-         235±3μs          189±3μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'direct')
-         360±4μs          289±4μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'transformation')
-         257±2μs          206±2μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'transformation')
-         330±5μs          265±3μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct')
-         189±5ms          151±2ms     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'transformation')
-        485±20μs         388±20μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'nunique', 'transformation')
-         438±4μs          349±1μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'sum', 'direct')
-         100±2μs       79.6±0.8μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'transformation')
-        95.6±3μs         75.9±5μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'direct')
-         141±3ms        112±0.6ms     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'transformation')
-           603ms        478±0.4ms     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'direct')
-     1.26±0.03ms          997±5μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
-        447±20μs          351±2μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'sum', 'transformation')
-         210±3μs          164±9μs     0.78  groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'transformation')
-         168±5μs          131±4μs     0.78  groupby.GroupByMethods.time_dtype_as_field('float', 'min', 'direct')
-        477±10μs          371±4μs     0.78  groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'direct')
-        93.1±1μs         72.3±4μs     0.78  groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'direct')
-         335±2μs          260±3μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
-        736±10μs         565±10μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'direct')
-         200±2μs          153±2μs     0.77  groupby.GroupByMethods.time_dtype_as_field('float', 'median', 'direct')
-         295±4μs          225±1μs     0.76  groupby.GroupByMethods.time_dtype_as_field('float', 'ffill', 'direct')
-         281±5μs          213±1μs     0.76  groupby.GroupByMethods.time_dtype_as_field('float', 'ffill', 'transformation')
-         161±1μs          122±7μs     0.76  groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'transformation')
-         358±4μs          270±3μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'direct')
-         400±8μs          301±2μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'direct')
-         486±5μs          365±2μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'nunique', 'direct')
-        922±20μs         690±10μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'direct')
-        471±10μs          351±5μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'transformation')
-     1.03±0.02ms          769±7μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
-         131±2μs         97.7±4μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct')
-        719±20μs          535±2μs     0.74  groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'transformation')
-     1.02±0.01ms          759±6μs     0.74  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation')
-         269±7μs        200±0.8μs     0.74  groupby.GroupByMethods.time_dtype_as_field('float', 'cumcount', 'transformation')
-         917±7μs          680±3μs     0.74  groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'transformation')
-         352±2μs        260±0.5μs     0.74  groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'direct')
-         261±8μs        193±0.6μs     0.74  groupby.GroupByMethods.time_dtype_as_field('int', 'cumcount', 'transformation')
-         358±2μs          264±6μs     0.74  groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'transformation')
-         156±6μs          115±2μs     0.73  groupby.GroupByMethods.time_dtype_as_field('float', 'mean', 'transformation')
-        634±10μs          465±2μs     0.73  groupby.GroupByMethods.time_dtype_as_field('float', 'cumprod', 'transformation')
-        676±20μs         490±10μs     0.73  groupby.GroupByMethods.time_dtype_as_field('float', 'cumsum', 'direct')
-         224±1μs          163±1μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'first', 'direct')
-        741±10μs         536±10μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'transformation')
-         222±3μs        160±0.7μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'first', 'transformation')
-         102±2μs       73.4±0.2μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'direct')
-        644±20μs          465±2μs     0.72  groupby.GroupByMethods.time_dtype_as_field('float', 'cumsum', 'transformation')
-         411±4μs          296±2μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'transformation')
-         150±3μs        108±0.3μs     0.72  groupby.GroupByMethods.time_dtype_as_field('float', 'last', 'direct')
-       103±0.6μs       73.7±0.8μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation')
-           595ms          424±4ms     0.71  groupby.GroupByMethods.time_dtype_as_field('float', 'mad', 'direct')
-        713±10μs         508±30μs     0.71  groupby.GroupByMethods.time_dtype_as_field('int', 'cummax', 'direct')
-         162±1μs        115±0.7μs     0.71  groupby.GroupByMethods.time_dtype_as_field('float', 'first', 'direct')
-           1.84s            1.30s     0.71  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')
-         184±3μs          130±2μs     0.71  groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'transformation')
-           1.83s            1.29s     0.71  groupby.GroupByMethods.time_dtype_as_field('float', 'describe', 'transformation')
-         162±4μs          115±1μs     0.71  groupby.GroupByMethods.time_dtype_as_field('float', 'first', 'transformation')
-        637±10μs          450±6μs     0.71  groupby.GroupByMethods.time_dtype_as_field('float', 'cummax', 'direct')
-        666±10μs          470±5μs     0.71  groupby.GroupByMethods.time_dtype_as_field('float', 'cumprod', 'direct')
-         273±4μs          192±2μs     0.70  groupby.GroupByMethods.time_dtype_as_field('int', 'cumcount', 'direct')
-           1.84s            1.29s     0.70  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'direct')
-        476±10μs        333±0.6μs     0.70  groupby.GroupByMethods.time_dtype_as_field('int', 'head', 'direct')
-         447±3μs        311±0.8μs     0.70  groupby.GroupByMethods.time_dtype_as_field('float', 'head', 'transformation')
-        456±10μs          315±1μs     0.69  groupby.GroupByMethods.time_dtype_as_field('float', 'head', 'direct')
-           600ms          414±2ms     0.69  groupby.GroupByMethods.time_dtype_as_field('float', 'mad', 'transformation')
-           1.85s            1.27s     0.69  groupby.GroupByMethods.time_dtype_as_field('float', 'describe', 'direct')
-         166±2μs          113±2μs     0.68  groupby.GroupByMethods.time_dtype_as_field('float', 'mean', 'direct')
-         214±2μs          143±1μs     0.67  groupby.GroupByMethods.time_dtype_as_field('float', 'median', 'transformation')
-     1.08±0.08ms         565±60μs     0.52  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
-      1.35±0.2ms          605±8μs     0.45  groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback jreback added this to the 0.23.1 milestone May 29, 2018
@jreback jreback merged commit d30cc74 into pandas-dev:master May 29, 2018
@jreback
Copy link
Contributor

jreback commented May 29, 2018

thanks @WillAyd

@WillAyd WillAyd deleted the stable-grp-fill branch May 29, 2018 03:38
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018
@WillAyd WillAyd mentioned this pull request Jun 8, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jun 9, 2018
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: incorrect groupby().ffill() in pandas 0.23.0
4 participants