Skip to content

PERF: Fix performance regression with Series statistical ops (#25952) #25953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 28, 2019

Conversation

ArtificialQualia
Copy link
Contributor

#25743 introduced a performance regression in Series MultiIndex statistical operations.

This happened due to the additional required check for Series groupby operations that already existed for DataFrame. By ensuring that the check needs to be run, this regression can be avoided also potentially slightly speeding up similar DataFrame operations.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls create an asv for this or show a current one that is affected
u don’t need a note as this is unreleased code

@WillAyd
Copy link
Member

WillAyd commented Apr 2, 2019

Can you post ASV results for this? Also does this impact performance of DataFrame ops?

@WillAyd WillAyd added the Performance Memory or execution speed performance label Apr 2, 2019
@ArtificialQualia
Copy link
Contributor Author

Here is the affected ASV tests:

       before           after         ratio
     [00c119c5]       [98abd413]
     <master>         <fix-groupby-performance>
-      20.8±0.2ms       18.2±0.5ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(0, 'mad')
-      12.1±0.2ms       9.71±0.2ms     0.80  stat_ops.SeriesMultiIndexOps.time_op(0, 'skew')
-      12.1±0.2ms       9.61±0.2ms     0.79  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
-      8.58±0.2ms      5.96±0.03ms     0.69  stat_ops.SeriesMultiIndexOps.time_op(1, 'sem')
-      8.73±0.1ms       5.79±0.2ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
-     8.92±0.08ms      5.92±0.09ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(1, 'median')
-      8.67±0.1ms       5.71±0.1ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(0, 'median')
-      7.23±0.2ms       4.33±0.4ms     0.60  stat_ops.SeriesMultiIndexOps.time_op(1, 'prod')
-      7.67±0.2ms       4.48±0.3ms     0.58  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')
-      7.70±0.2ms      4.39±0.06ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')
-      7.13±0.1ms      3.92±0.03ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(0, 'sum')
-      7.35±0.2ms       4.01±0.1ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
-      7.24±0.1ms       3.94±0.2ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'mean')
-      7.18±0.1ms      3.90±0.05ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(0, 'mean')
-     7.62±0.06ms      4.12±0.04ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-      7.15±0.1ms      3.84±0.08ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(0, 'prod')

Most MultiIndex DataFrame operations are improved as well, but only slightly. Not enough to meet the 10% threshold.

I'll run a full ASV tonight to see if there are any other major affected areas.

@codecov
Copy link

codecov bot commented Apr 2, 2019

Codecov Report

Merging #25953 into master will decrease coverage by 0.01%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25953      +/-   ##
==========================================
- Coverage   91.82%    91.8%   -0.02%     
==========================================
  Files         175      175              
  Lines       52581    52582       +1     
==========================================
- Hits        48280    48271       -9     
- Misses       4301     4311      +10
Flag Coverage Δ
#multiple 90.35% <80%> (-0.02%) ⬇️
#single 41.89% <0%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/grouper.py 97.44% <80%> (-0.73%) ⬇️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/indexes/base.py 96.57% <0%> (-0.22%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00c119c...98abd41. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 2, 2019

Codecov Report

Merging #25953 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25953      +/-   ##
==========================================
- Coverage   91.98%   91.96%   -0.02%     
==========================================
  Files         175      175              
  Lines       52372    52369       -3     
==========================================
- Hits        48172    48161      -11     
- Misses       4200     4208       +8
Flag Coverage Δ
#multiple 90.52% <100%> (-0.01%) ⬇️
#single 40.7% <0%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/grouper.py 98.52% <100%> (+0.34%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/indexes/base.py 96.72% <0%> (-0.22%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.61% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48ea04f...92858c1. Read the comment docs.

@ArtificialQualia
Copy link
Contributor Author

whatsnew line has been removed.

Here are the 'full' ASV results. I re-ran all the deviant tests and removed those that weren't consistent. Here are the final results:

       before           after         ratio
     [00c119c5]       [98abd413]
     <master>         <fix-groupby-performance>
-      20.5±0.2ms       18.5±0.3ms     0.90  stat_ops.SeriesMultiIndexOps.time_op(0, 'mad')
-     4.49±0.04ms       3.77±0.1ms     0.84  groupby.Size.time_category_size
-      13.9±0.2ms       9.63±0.1ms     0.69  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
-     8.78±0.06ms       5.84±0.1ms     0.67  stat_ops.SeriesMultiIndexOps.time_op(1, 'sem')
-      8.82±0.3ms       5.80±0.1ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(1, 'median')
-      8.94±0.4ms       5.82±0.1ms     0.65  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
-      8.70±0.2ms       5.60±0.1ms     0.64  stat_ops.SeriesMultiIndexOps.time_op(0, 'median')
-      7.74±0.1ms      4.75±0.07ms     0.61  stat_ops.SeriesMultiIndexOps.time_op(1, 'std')
-      6.57±0.1ms       3.82±0.1ms     0.58  stat_ops.SeriesMultiIndexOps.time_op(0, 'mean')
-      8.62±0.1ms       4.96±0.2ms     0.58  groupby.TransformBools.time_transform_mean
-      7.64±0.1ms      4.37±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')
-      7.72±0.2ms      4.39±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')
-      7.31±0.1ms       4.13±0.1ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'sum')
-      7.41±0.2ms      4.10±0.08ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
-     7.63±0.09ms      4.19±0.09ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-      7.47±0.2ms       4.06±0.2ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'mean')
-      7.35±0.2ms       3.98±0.1ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'prod')
-      7.27±0.2ms      3.83±0.07ms     0.53  stat_ops.SeriesMultiIndexOps.time_op(0, 'prod')

Notably, groupby.Size.time_category_size and groupby.TransformBools.time_transform_mean are also consistently faster by more than 10%.

@pep8speaks
Copy link

pep8speaks commented Apr 5, 2019

Hello @ArtificialQualia! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-28 20:04:36 UTC

@ArtificialQualia
Copy link
Contributor Author

Anything more needed on this PR?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the asv's that improved for this change. I believe the regression is in 0.25, correct? so no whatsnew is then needed. (pls confirm)

@@ -520,21 +520,16 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True,
any_arraylike = any(isinstance(g, (list, tuple, Series, Index, np.ndarray))
for g in keys)

try:
if (not any_callable and not any_arraylike and not any_groupers and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments here on what is being checked

@jreback
Copy link
Contributor

jreback commented Apr 28, 2019

merge master as well

@ArtificialQualia
Copy link
Contributor Author

Master has been merged

Correct, the change this is fixing was merged in 0.25. No need for a whatsnew. (see #25743)

Here are the ASV results from a previous comment:

Here are the 'full' ASV results. I re-ran all the deviant tests and removed those that weren't consistent. Here are the final results:

       before           after         ratio
     [00c119c5]       [98abd413]
     <master>         <fix-groupby-performance>
-      20.5±0.2ms       18.5±0.3ms     0.90  stat_ops.SeriesMultiIndexOps.time_op(0, 'mad')
-     4.49±0.04ms       3.77±0.1ms     0.84  groupby.Size.time_category_size
-      13.9±0.2ms       9.63±0.1ms     0.69  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
-     8.78±0.06ms       5.84±0.1ms     0.67  stat_ops.SeriesMultiIndexOps.time_op(1, 'sem')
-      8.82±0.3ms       5.80±0.1ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(1, 'median')
-      8.94±0.4ms       5.82±0.1ms     0.65  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
-      8.70±0.2ms       5.60±0.1ms     0.64  stat_ops.SeriesMultiIndexOps.time_op(0, 'median')
-      7.74±0.1ms      4.75±0.07ms     0.61  stat_ops.SeriesMultiIndexOps.time_op(1, 'std')
-      6.57±0.1ms       3.82±0.1ms     0.58  stat_ops.SeriesMultiIndexOps.time_op(0, 'mean')
-      8.62±0.1ms       4.96±0.2ms     0.58  groupby.TransformBools.time_transform_mean
-      7.64±0.1ms      4.37±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')
-      7.72±0.2ms      4.39±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')
-      7.31±0.1ms       4.13±0.1ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'sum')
-      7.41±0.2ms      4.10±0.08ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
-     7.63±0.09ms      4.19±0.09ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-      7.47±0.2ms       4.06±0.2ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'mean')
-      7.35±0.2ms       3.98±0.1ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'prod')
-      7.27±0.2ms      3.83±0.07ms     0.53  stat_ops.SeriesMultiIndexOps.time_op(0, 'prod')

Notably, groupby.Size.time_category_size and groupby.TransformBools.time_transform_mean are also consistently faster by more than 10%.

@jreback jreback added this to the 0.25.0 milestone Apr 28, 2019
@jreback jreback merged commit f572ec4 into pandas-dev:master Apr 28, 2019
@jreback
Copy link
Contributor

jreback commented Apr 28, 2019

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix performance regression with Series operations
5 participants