CLN: ASV long and broken benchmarks #19113

mroeschke · 2018-01-07T06:26:28Z

Shortened the size of the data in GroupbyMethods since some benchmarks took +30 seconds on my machine, and additionally fixed some broken benchmarks

asv dev -b ^groupby.GroupByMethods
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupByMethods.time_method                     ok
[100.00%] ···· 
               ======= ============== ========
                dtype      method             
               ------- -------------- --------
                 int        all        256ms  
                 int        any        255ms  
                 int       count       925μs  
                 int      cumcount     1.15ms 
                 int       cummax      1.13ms 
                 int       cummin      1.15ms 
                 int      cumprod      1.58ms 
                 int       cumsum      1.16ms 
                 int      describe     3.25s  
                 int       first       1.12ms 
                 int        head       1.37ms 
                 int        last       1.12ms 
                 int        mad        1.42s  
                 int        max        1.12ms 
                 int        min        1.16ms 
                 int       median      1.53ms 
                 int        mean       1.43ms 
                 int      nunique      1.40ms 
                 int     pct_change    1.56s  
                 int        prod       1.53ms 
                 int        rank       380ms  
                 int        sem        414ms  
                 int       shift       974μs  
                 int        size       858μs  
                 int        skew       414ms  
                 int        std        1.46ms 
                 int        sum        1.50ms 
                 int        tail       1.45ms 
                 int       unique      289ms  
                 int    value_counts   2.35ms 
                 int        var        1.34ms 
                float       all        402ms  
                float       any        406ms  
                float      count       1.18ms 
                float     cumcount     1.33ms 
                float      cummax      1.40ms 
                float      cummin      1.40ms 
                float     cumprod      1.75ms 
                float      cumsum      1.40ms 
                float     describe     5.02s  
                float      first       1.37ms 
                float       head       1.58ms 
                float       last       1.36ms 
                float       mad        2.01s  
                float       max        1.38ms 
                float       min        1.37ms 
                float      median      1.80ms 
                float       mean       1.79ms 
                float     nunique      1.60ms 
                float    pct_change    2.17s  
                float       prod       1.75ms 
                float       rank       623ms  
                float       sem        416ms  
                float      shift       1.18ms 
                float       size       1.09ms 
                float       skew       646ms  
                float       std        1.51ms 
                float       sum        1.77ms 
                float       tail       1.63ms 
                float      unique      457ms  
                float   value_counts   2.63ms 
                float       var        1.43ms 
               ======= ============== ========

$ asv dev -b ^groupby.GroupStrings
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupStrings.time_multi_columns             781ms
(pandas_dev)matt@matt-Inspiron-1545:~/Projects/pandas-mroeschke/asv_bench$ asv dev -b ^frame_methods.ToHTML
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running frame_methods.ToHTML.time_to_html_mixed             565ms

codecov · 2018-01-07T08:22:55Z

Codecov Report

Merging #19113 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #19113   +/-   ##
=======================================
  Coverage   91.51%   91.51%           
=======================================
  Files         148      148           
  Lines       48753    48753           
=======================================
  Hits        44616    44616           
  Misses       4137     4137

Flag	Coverage Δ
#multiple	`89.88% <ø> (ø)`	⬆️
#single	`41.6% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36a71eb...5898212. Read the comment docs.

jreback · 2018-01-07T15:13:41Z

can you reduce the size of pct_change, mad, describe to be in-line with the others? or are they just exponentially slower?

mroeschke · 2018-01-09T05:04:15Z

I believe mad, pct_change, and describe are just slower groupby methods. mad and pct_change are in the _common_apply_whitelist while describe is implemented via apply()

In [6]: df.head() #2000 rows x 2 columns
Out[6]: 
    key  values
0  1271     534
1   654     195
2   758     336
3   428     298
4   254     531

In [7]: %timeit df.groupby('key')['values'].describe()
1 loop, best of 3: 3.12 s per loop

In [8]: %timeit df.groupby('key')['values'].pct_change()
1 loop, best of 3: 1.09 s per loop

In [9]: %timeit df.groupby('key')['values'].mad()
1 loop, best of 3: 991 ms per loop

In [10]: %timeit df.groupby('key')['values'].sem() #  specifically defined
100 loops, best of 3: 2.21 ms per loop

In [11]: %timeit df.groupby('key')['values'].prod() #  specifically defined
100 loops, best of 3: 2.25 ms per loop

mroeschke · 2018-01-09T17:48:28Z

I believe it's better to keep the size of the data equivalent for all the groupby methods so it's easier to identify which methods (like mad, pct_change) might need a performance boost.

jreback · 2018-01-10T00:19:38Z

@mroeschke ok great. thanks for the comment. yes some of these are non-performant.

jreback · 2018-01-10T00:20:31Z

if you want to go thru the non-performant ones and see if we have issues for them would be great (I know we have some, but ideally we could just create a single master issue).

CLN: ASV long and broken benchmarks

5898212

jreback added the Benchmark Performance (ASV) benchmarks label Jan 7, 2018

jreback added this to the 0.23.0 milestone Jan 10, 2018

jreback merged commit 7ea5179 into pandas-dev:master Jan 10, 2018

mroeschke deleted the asv_cleanups branch January 10, 2018 00:30

mroeschke mentioned this pull request Jan 10, 2018

PERF: Discrepancy in groupby methods #19165

Closed

7 tasks

maximveksler pushed a commit to maximveksler/pandas that referenced this pull request Jan 11, 2018

CLN: ASV long and broken benchmarks (pandas-dev#19113)

e1e6e9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: ASV long and broken benchmarks #19113

CLN: ASV long and broken benchmarks #19113

mroeschke commented Jan 7, 2018

codecov bot commented Jan 7, 2018 •

edited

Loading

jreback commented Jan 7, 2018

mroeschke commented Jan 9, 2018 •

edited

Loading

mroeschke commented Jan 9, 2018

jreback commented Jan 10, 2018

jreback commented Jan 10, 2018

CLN: ASV long and broken benchmarks #19113

CLN: ASV long and broken benchmarks #19113

Conversation

mroeschke commented Jan 7, 2018

codecov bot commented Jan 7, 2018 • edited Loading

Codecov Report

jreback commented Jan 7, 2018

mroeschke commented Jan 9, 2018 • edited Loading

mroeschke commented Jan 9, 2018

jreback commented Jan 10, 2018

jreback commented Jan 10, 2018

codecov bot commented Jan 7, 2018 •

edited

Loading

mroeschke commented Jan 9, 2018 •

edited

Loading