PERF: improve performance of NDFrame.describe #21274

DataOmbudsman · 2018-05-31T12:11:36Z

A one-line change that enables to calculate the percentiles in describe more efficiently. The point is that calculating percentiles in one pass is faster than separately.

describe (with default percentiles argument) becomes 25-30% faster than before for numerical Series and DataFrames.

Setup

import timeit

setup = '''
import numpy as np
import pandas as pd
np.random.seed(123)
s = pd.Series(np.random.randint(0, 100, 1000000))
'''

Benchmark

min(timeit.Timer('s.describe()', setup=setup).repeat(100, 1))

Results

On master:

0.06349272100487724

With this change:

0.04745814300258644

Results are similar for DataFrames.

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd · 2018-05-31T15:44:32Z

Typically for performance-related changes we look for an ASV to measure and track over time. Can you add one to asv_bench/benchmarks/frame_methods.py and post the results of the benchmark here?

codecov · 2018-05-31T17:01:27Z

Codecov Report

Merging #21274 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #21274      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49549       +3     
==========================================
+ Hits        45509    45512       +3     
  Misses       4037     4037

Flag	Coverage Δ
#multiple	`90.25% <ø> (ø)`	⬆️
#single	`41.87% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/generic.py	`96.12% <ø> (ø)`	⬆️
pandas/io/formats/csvs.py	`98.14% <0%> (+0.01%)`	⬆️
pandas/core/indexes/interval.py	`93.16% <0%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbec58e...6dda68e. Read the comment docs.

DataOmbudsman · 2018-06-01T10:59:54Z

Sure. Thanks for the suggestion. Here are my ASV benchmarks. These also show the improvement.

Setup

class Describe(object):

    goal_time = 0.2

    def setup(self):
        np.random.seed(123)
        self.df = DataFrame({
            'a': np.random.randint(0, 100, int(1e6)),
            'b': np.random.randint(0, 100, int(1e6)),
            'c': np.random.randint(0, 100, int(1e6)),
        })

    def time_series_describe(self):
        self.df['a'].describe()

    def time_dataframe_describe(self):
        self.df.describe()

Results

before	after	ratio
689±10ms	495±6ms	0.72	frame_methods.Describe.time_dataframe_describe
234±9ms	166±6ms	0.71	frame_methods.Describe.time_series_describe

WillAyd · 2018-06-02T20:57:38Z

OK thanks. Can you update your PR to include the benchmark and a whatsnew note for 0.24?

Calculating percentiles in one pass is faster than separately.

pep8speaks · 2018-06-04T11:33:40Z

Hello @DataOmbudsman! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 05, 2018 at 12:40 Hours UTC

mroeschke · 2018-06-04T22:52:34Z

asv_bench/benchmarks/frame_methods.py

+    goal_time = 0.2
+
+    def setup(self):
+        np.random.seed(123)


You can remove the random seed; this is handled when setup is imported at the top (from .pandas_vb_common import setup)

jreback · 2018-06-05T10:39:56Z

doc/source/whatsnew/v0.24.0.txt

@@ -63,8 +63,7 @@ Removal of prior version deprecations/changes
 Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

-
-
+- Improved performance of :func:`Series.describe` in case of numeric dtpyes


can you add the issue number (this pr number as we don't have an issue)

OK but I'm unsure about what format is expected. Do you think a link to an external URL (such as here) would be appropriate? E.g., `pull request #21274 <https://github.com/pandas-dev/pandas/pull/21274/>`_. Or something else?

same format as all the others, just use :issue:`number`

I see now that the URL of the issue is translated to the URL of the PR. That's great.

jorisvandenbossche · 2018-06-05T17:01:22Z

@DataOmbudsman Thanks!

WillAyd added the Performance Memory or execution speed performance label May 31, 2018

DataOmbudsman added 3 commits June 4, 2018 12:31

PERF: improve performance of NDFrame.describe

7efefb9

Calculating percentiles in one pass is faster than separately.

Add ASV benchmark

7e3ad12

Add whatsnew entry

70668a1

DataOmbudsman force-pushed the improve-ndframe-describe-performance branch from 724f30e to 70668a1 Compare June 4, 2018 11:33

Add blank line for pep8

b866d0f

jorisvandenbossche approved these changes Jun 4, 2018

View reviewed changes

jorisvandenbossche added this to the 0.24.0 milestone Jun 4, 2018

WillAyd approved these changes Jun 4, 2018

View reviewed changes

mroeschke reviewed Jun 4, 2018

View reviewed changes

remove random seed

31216c8

jreback requested changes Jun 5, 2018

View reviewed changes

Add issue (PR) number

6dda68e

jorisvandenbossche merged commit 7dc6f70 into pandas-dev:master Jun 5, 2018

DataOmbudsman deleted the improve-ndframe-describe-performance branch June 8, 2018 08:12

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

PERF: improve performance of NDFrame.describe (pandas-dev#21274)

864ee3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: improve performance of NDFrame.describe #21274

PERF: improve performance of NDFrame.describe #21274

Uh oh!

DataOmbudsman commented May 31, 2018 •

edited

Loading

Uh oh!

WillAyd commented May 31, 2018

Uh oh!

codecov bot commented May 31, 2018 •

edited

Loading

Uh oh!

DataOmbudsman commented Jun 1, 2018

Uh oh!

WillAyd commented Jun 2, 2018

Uh oh!

pep8speaks commented Jun 4, 2018 •

edited

Loading

Uh oh!

mroeschke Jun 4, 2018

Uh oh!

jreback Jun 5, 2018

Uh oh!

DataOmbudsman Jun 5, 2018

Uh oh!

jreback Jun 5, 2018

Uh oh!

DataOmbudsman Jun 5, 2018

Uh oh!

jorisvandenbossche commented Jun 5, 2018

Uh oh!

Uh oh!

Uh oh!

PERF: improve performance of NDFrame.describe #21274

PERF: improve performance of NDFrame.describe #21274

Uh oh!

Conversation

DataOmbudsman commented May 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

Benchmark

Results

Uh oh!

WillAyd commented May 31, 2018

Uh oh!

codecov bot commented May 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DataOmbudsman commented Jun 1, 2018

Setup

Results

Uh oh!

WillAyd commented Jun 2, 2018

Uh oh!

pep8speaks commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on June 05, 2018 at 12:40 Hours UTC

Uh oh!

mroeschke Jun 4, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Jun 5, 2018

Choose a reason for hiding this comment

Uh oh!

DataOmbudsman Jun 5, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Jun 5, 2018

Choose a reason for hiding this comment

Uh oh!

DataOmbudsman Jun 5, 2018

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 5, 2018

Uh oh!

Uh oh!

DataOmbudsman commented May 31, 2018 •

edited

Loading

codecov bot commented May 31, 2018 •

edited

Loading

pep8speaks commented Jun 4, 2018 •

edited

Loading