Skip to content

CLN: ASV long and broken benchmarks #19113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 10, 2018
Merged

Conversation

mroeschke
Copy link
Member

xref

Shortened the size of the data in GroupbyMethods since some benchmarks took +30 seconds on my machine, and additionally fixed some broken benchmarks

asv dev -b ^groupby.GroupByMethods
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupByMethods.time_method                     ok
[100.00%] ···· 
               ======= ============== ========
                dtype      method             
               ------- -------------- --------
                 int        all        256ms  
                 int        any        255ms  
                 int       count       925μs  
                 int      cumcount     1.15ms 
                 int       cummax      1.13ms 
                 int       cummin      1.15ms 
                 int      cumprod      1.58ms 
                 int       cumsum      1.16ms 
                 int      describe     3.25s  
                 int       first       1.12ms 
                 int        head       1.37ms 
                 int        last       1.12ms 
                 int        mad        1.42s  
                 int        max        1.12ms 
                 int        min        1.16ms 
                 int       median      1.53ms 
                 int        mean       1.43ms 
                 int      nunique      1.40ms 
                 int     pct_change    1.56s  
                 int        prod       1.53ms 
                 int        rank       380ms  
                 int        sem        414ms  
                 int       shift       974μs  
                 int        size       858μs  
                 int        skew       414ms  
                 int        std        1.46ms 
                 int        sum        1.50ms 
                 int        tail       1.45ms 
                 int       unique      289ms  
                 int    value_counts   2.35ms 
                 int        var        1.34ms 
                float       all        402ms  
                float       any        406ms  
                float      count       1.18ms 
                float     cumcount     1.33ms 
                float      cummax      1.40ms 
                float      cummin      1.40ms 
                float     cumprod      1.75ms 
                float      cumsum      1.40ms 
                float     describe     5.02s  
                float      first       1.37ms 
                float       head       1.58ms 
                float       last       1.36ms 
                float       mad        2.01s  
                float       max        1.38ms 
                float       min        1.37ms 
                float      median      1.80ms 
                float       mean       1.79ms 
                float     nunique      1.60ms 
                float    pct_change    2.17s  
                float       prod       1.75ms 
                float       rank       623ms  
                float       sem        416ms  
                float      shift       1.18ms 
                float       size       1.09ms 
                float       skew       646ms  
                float       std        1.51ms 
                float       sum        1.77ms 
                float       tail       1.63ms 
                float      unique      457ms  
                float   value_counts   2.63ms 
                float       var        1.43ms 
               ======= ============== ========
$ asv dev -b ^groupby.GroupStrings
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupStrings.time_multi_columns             781ms
(pandas_dev)matt@matt-Inspiron-1545:~/Projects/pandas-mroeschke/asv_bench$ asv dev -b ^frame_methods.ToHTML
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running frame_methods.ToHTML.time_to_html_mixed             565ms

@codecov
Copy link

codecov bot commented Jan 7, 2018

Codecov Report

Merging #19113 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #19113   +/-   ##
=======================================
  Coverage   91.51%   91.51%           
=======================================
  Files         148      148           
  Lines       48753    48753           
=======================================
  Hits        44616    44616           
  Misses       4137     4137
Flag Coverage Δ
#multiple 89.88% <ø> (ø) ⬆️
#single 41.6% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36a71eb...5898212. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jan 7, 2018

can you reduce the size of pct_change, mad, describe to be in-line with the others? or are they just exponentially slower?

@jreback jreback added the Benchmark Performance (ASV) benchmarks label Jan 7, 2018
@mroeschke
Copy link
Member Author

mroeschke commented Jan 9, 2018

I believe mad, pct_change, and describe are just slower groupby methods. mad and pct_change are in the _common_apply_whitelist while describe is implemented via apply()

In [6]: df.head() #2000 rows x 2 columns
Out[6]: 
    key  values
0  1271     534
1   654     195
2   758     336
3   428     298
4   254     531

In [7]: %timeit df.groupby('key')['values'].describe()
1 loop, best of 3: 3.12 s per loop

In [8]: %timeit df.groupby('key')['values'].pct_change()
1 loop, best of 3: 1.09 s per loop

In [9]: %timeit df.groupby('key')['values'].mad()
1 loop, best of 3: 991 ms per loop

In [10]: %timeit df.groupby('key')['values'].sem() #  specifically defined
100 loops, best of 3: 2.21 ms per loop

In [11]: %timeit df.groupby('key')['values'].prod() #  specifically defined
100 loops, best of 3: 2.25 ms per loop

@mroeschke
Copy link
Member Author

I believe it's better to keep the size of the data equivalent for all the groupby methods so it's easier to identify which methods (like mad, pct_change) might need a performance boost.

@jreback
Copy link
Contributor

jreback commented Jan 10, 2018

@mroeschke ok great. thanks for the comment. yes some of these are non-performant.

@jreback
Copy link
Contributor

jreback commented Jan 10, 2018

if you want to go thru the non-performant ones and see if we have issues for them would be great (I know we have some, but ideally we could just create a single master issue).

@jreback jreback added this to the 0.23.0 milestone Jan 10, 2018
@jreback jreback merged commit 7ea5179 into pandas-dev:master Jan 10, 2018
@mroeschke mroeschke deleted the asv_cleanups branch January 10, 2018 00:30
maximveksler pushed a commit to maximveksler/pandas that referenced this pull request Jan 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants