PERF: Fix performance regression with Series statistical ops (#25952) #25953

ArtificialQualia · 2019-04-02T01:22:45Z

closes Fix performance regression with Series operations #25952
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

#25743 introduced a performance regression in Series MultiIndex statistical operations.

This happened due to the additional required check for Series groupby operations that already existed for DataFrame. By ensuring that the check needs to be run, this regression can be avoided also potentially slightly speeding up similar DataFrame operations.

…dev#25952)

jreback

pls create an asv for this or show a current one that is affected
u don’t need a note as this is unreleased code

WillAyd · 2019-04-02T01:26:32Z

Can you post ASV results for this? Also does this impact performance of DataFrame ops?

ArtificialQualia · 2019-04-02T01:47:21Z

Here is the affected ASV tests:

       before           after         ratio
     [00c119c5]       [98abd413]
     <master>         <fix-groupby-performance>
-      20.8±0.2ms       18.2±0.5ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(0, 'mad')
-      12.1±0.2ms       9.71±0.2ms     0.80  stat_ops.SeriesMultiIndexOps.time_op(0, 'skew')
-      12.1±0.2ms       9.61±0.2ms     0.79  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
-      8.58±0.2ms      5.96±0.03ms     0.69  stat_ops.SeriesMultiIndexOps.time_op(1, 'sem')
-      8.73±0.1ms       5.79±0.2ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
-     8.92±0.08ms      5.92±0.09ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(1, 'median')
-      8.67±0.1ms       5.71±0.1ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(0, 'median')
-      7.23±0.2ms       4.33±0.4ms     0.60  stat_ops.SeriesMultiIndexOps.time_op(1, 'prod')
-      7.67±0.2ms       4.48±0.3ms     0.58  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')
-      7.70±0.2ms      4.39±0.06ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')
-      7.13±0.1ms      3.92±0.03ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(0, 'sum')
-      7.35±0.2ms       4.01±0.1ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
-      7.24±0.1ms       3.94±0.2ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'mean')
-      7.18±0.1ms      3.90±0.05ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(0, 'mean')
-     7.62±0.06ms      4.12±0.04ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-      7.15±0.1ms      3.84±0.08ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(0, 'prod')

Most MultiIndex DataFrame operations are improved as well, but only slightly. Not enough to meet the 10% threshold.

I'll run a full ASV tonight to see if there are any other major affected areas.

codecov · 2019-04-02T02:57:55Z

Codecov Report

Merging #25953 into master will decrease coverage by 0.01%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master   #25953      +/-   ##
==========================================
- Coverage   91.82%    91.8%   -0.02%     
==========================================
  Files         175      175              
  Lines       52581    52582       +1     
==========================================
- Hits        48280    48271       -9     
- Misses       4301     4311      +10

Flag	Coverage Δ
#multiple	`90.35% <80%> (-0.02%)`	⬇️
#single	`41.89% <0%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/grouper.py	`97.44% <80%> (-0.73%)`	⬇️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/indexes/base.py	`96.57% <0%> (-0.22%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00c119c...98abd41. Read the comment docs.

codecov · 2019-04-02T02:57:56Z

Codecov Report

Merging #25953 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25953      +/-   ##
==========================================
- Coverage   91.98%   91.96%   -0.02%     
==========================================
  Files         175      175              
  Lines       52372    52369       -3     
==========================================
- Hits        48172    48161      -11     
- Misses       4200     4208       +8

Flag	Coverage Δ
#multiple	`90.52% <100%> (-0.01%)`	⬇️
#single	`40.7% <0%> (-0.14%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/grouper.py	`98.52% <100%> (+0.34%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/indexes/base.py	`96.72% <0%> (-0.22%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.61% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48ea04f...92858c1. Read the comment docs.

ArtificialQualia · 2019-04-02T11:41:33Z

whatsnew line has been removed.

Here are the 'full' ASV results. I re-ran all the deviant tests and removed those that weren't consistent. Here are the final results:

       before           after         ratio
     [00c119c5]       [98abd413]
     <master>         <fix-groupby-performance>
-      20.5±0.2ms       18.5±0.3ms     0.90  stat_ops.SeriesMultiIndexOps.time_op(0, 'mad')
-     4.49±0.04ms       3.77±0.1ms     0.84  groupby.Size.time_category_size
-      13.9±0.2ms       9.63±0.1ms     0.69  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
-     8.78±0.06ms       5.84±0.1ms     0.67  stat_ops.SeriesMultiIndexOps.time_op(1, 'sem')
-      8.82±0.3ms       5.80±0.1ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(1, 'median')
-      8.94±0.4ms       5.82±0.1ms     0.65  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
-      8.70±0.2ms       5.60±0.1ms     0.64  stat_ops.SeriesMultiIndexOps.time_op(0, 'median')
-      7.74±0.1ms      4.75±0.07ms     0.61  stat_ops.SeriesMultiIndexOps.time_op(1, 'std')
-      6.57±0.1ms       3.82±0.1ms     0.58  stat_ops.SeriesMultiIndexOps.time_op(0, 'mean')
-      8.62±0.1ms       4.96±0.2ms     0.58  groupby.TransformBools.time_transform_mean
-      7.64±0.1ms      4.37±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')
-      7.72±0.2ms      4.39±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')
-      7.31±0.1ms       4.13±0.1ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'sum')
-      7.41±0.2ms      4.10±0.08ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
-     7.63±0.09ms      4.19±0.09ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-      7.47±0.2ms       4.06±0.2ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'mean')
-      7.35±0.2ms       3.98±0.1ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'prod')
-      7.27±0.2ms      3.83±0.07ms     0.53  stat_ops.SeriesMultiIndexOps.time_op(0, 'prod')

Notably, groupby.Size.time_category_size and groupby.TransformBools.time_transform_mean are also consistently faster by more than 10%.

pandas/core/groupby/grouper.py

pep8speaks · 2019-04-05T01:09:38Z

Hello @ArtificialQualia! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-28 20:04:36 UTC

ArtificialQualia · 2019-04-21T21:59:32Z

Anything more needed on this PR?

jreback

can you show the asv's that improved for this change. I believe the regression is in 0.25, correct? so no whatsnew is then needed. (pls confirm)

jreback · 2019-04-28T18:28:12Z

pandas/core/groupby/grouper.py

@@ -520,21 +520,16 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True,
    any_arraylike = any(isinstance(g, (list, tuple, Series, Index, np.ndarray))
                        for g in keys)

-    try:
+    if (not any_callable and not any_arraylike and not any_groupers and


can you add some comments here on what is being checked

jreback · 2019-04-28T18:29:31Z

merge master as well

…x-groupby-performance

ArtificialQualia · 2019-04-28T20:05:13Z

Master has been merged

Correct, the change this is fixing was merged in 0.25. No need for a whatsnew. (see #25743)

Here are the ASV results from a previous comment:

Here are the 'full' ASV results. I re-ran all the deviant tests and removed those that weren't consistent. Here are the final results:

       before           after         ratio
     [00c119c5]       [98abd413]
     <master>         <fix-groupby-performance>
-      20.5±0.2ms       18.5±0.3ms     0.90  stat_ops.SeriesMultiIndexOps.time_op(0, 'mad')
-     4.49±0.04ms       3.77±0.1ms     0.84  groupby.Size.time_category_size
-      13.9±0.2ms       9.63±0.1ms     0.69  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
-     8.78±0.06ms       5.84±0.1ms     0.67  stat_ops.SeriesMultiIndexOps.time_op(1, 'sem')
-      8.82±0.3ms       5.80±0.1ms     0.66  stat_ops.SeriesMultiIndexOps.time_op(1, 'median')
-      8.94±0.4ms       5.82±0.1ms     0.65  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
-      8.70±0.2ms       5.60±0.1ms     0.64  stat_ops.SeriesMultiIndexOps.time_op(0, 'median')
-      7.74±0.1ms      4.75±0.07ms     0.61  stat_ops.SeriesMultiIndexOps.time_op(1, 'std')
-      6.57±0.1ms       3.82±0.1ms     0.58  stat_ops.SeriesMultiIndexOps.time_op(0, 'mean')
-      8.62±0.1ms       4.96±0.2ms     0.58  groupby.TransformBools.time_transform_mean
-      7.64±0.1ms      4.37±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')
-      7.72±0.2ms      4.39±0.09ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')
-      7.31±0.1ms       4.13±0.1ms     0.57  stat_ops.SeriesMultiIndexOps.time_op(0, 'sum')
-      7.41±0.2ms      4.10±0.08ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
-     7.63±0.09ms      4.19±0.09ms     0.55  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-      7.47±0.2ms       4.06±0.2ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'mean')
-      7.35±0.2ms       3.98±0.1ms     0.54  stat_ops.SeriesMultiIndexOps.time_op(1, 'prod')
-      7.27±0.2ms      3.83±0.07ms     0.53  stat_ops.SeriesMultiIndexOps.time_op(0, 'prod')

Notably, groupby.Size.time_category_size and groupby.TransformBools.time_transform_mean are also consistently faster by more than 10%.

jreback · 2019-04-28T20:57:27Z

thanks

ArtificialQualia added 2 commits April 1, 2019 21:16

PERF: Fix performance regression with Series statistical ops (pandas-…

b147e87

…dev#25952)

clarify whatsnew

98abd41

jreback requested changes Apr 2, 2019

View reviewed changes

WillAyd added the Performance Memory or execution speed performance label Apr 2, 2019

removed whatsnew

10bc092

jreback requested changes Apr 2, 2019

View reviewed changes

pandas/core/groupby/grouper.py Outdated Show resolved Hide resolved

jreback reviewed Apr 5, 2019

View reviewed changes

pandas/core/groupby/grouper.py Outdated Show resolved Hide resolved

removing unnecessary code

ac2f9f2

ArtificialQualia added 5 commits April 4, 2019 21:09

fix spacing

1957a22

force rebuild

ce89e16

force rebuild

7020852

fore rebuild

d462002

force rebuild

0f42f2e

jreback requested changes Apr 28, 2019

View reviewed changes

ArtificialQualia added 2 commits April 28, 2019 15:43

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

854e784

…x-groupby-performance

added comment

92858c1

jreback added this to the 0.25.0 milestone Apr 28, 2019

jreback approved these changes Apr 28, 2019

View reviewed changes

jreback merged commit f572ec4 into pandas-dev:master Apr 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Fix performance regression with Series statistical ops (#25952) #25953

PERF: Fix performance regression with Series statistical ops (#25952) #25953

ArtificialQualia commented Apr 2, 2019

jreback left a comment

WillAyd commented Apr 2, 2019

ArtificialQualia commented Apr 2, 2019

codecov bot commented Apr 2, 2019

codecov bot commented Apr 2, 2019 •

edited

Loading

ArtificialQualia commented Apr 2, 2019

pep8speaks commented Apr 5, 2019 •

edited

Loading

ArtificialQualia commented Apr 21, 2019

jreback left a comment

jreback Apr 28, 2019

jreback commented Apr 28, 2019

ArtificialQualia commented Apr 28, 2019

jreback commented Apr 28, 2019

PERF: Fix performance regression with Series statistical ops (#25952) #25953

PERF: Fix performance regression with Series statistical ops (#25952) #25953

Conversation

ArtificialQualia commented Apr 2, 2019

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Apr 2, 2019

ArtificialQualia commented Apr 2, 2019

codecov bot commented Apr 2, 2019

Codecov Report

codecov bot commented Apr 2, 2019 • edited Loading

Codecov Report

ArtificialQualia commented Apr 2, 2019

pep8speaks commented Apr 5, 2019 • edited Loading

Comment last updated at 2019-04-28 20:04:36 UTC

ArtificialQualia commented Apr 21, 2019

jreback left a comment

Choose a reason for hiding this comment

jreback Apr 28, 2019

Choose a reason for hiding this comment

jreback commented Apr 28, 2019

ArtificialQualia commented Apr 28, 2019

jreback commented Apr 28, 2019

codecov bot commented Apr 2, 2019 •

edited

Loading

pep8speaks commented Apr 5, 2019 •

edited

Loading