Skip to content

PERF: improve performance of NDFrame.describe #21274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DataOmbudsman
Copy link
Contributor

@DataOmbudsman DataOmbudsman commented May 31, 2018

A one-line change that enables to calculate the percentiles in describe more efficiently. The point is that calculating percentiles in one pass is faster than separately.

describe (with default percentiles argument) becomes 25-30% faster than before for numerical Series and DataFrames.

Setup

import timeit

setup = '''
import numpy as np
import pandas as pd
np.random.seed(123)
s = pd.Series(np.random.randint(0, 100, 1000000))
'''

Benchmark

min(timeit.Timer('s.describe()', setup=setup).repeat(100, 1))

Results

On master:

0.06349272100487724

With this change:

0.04745814300258644

Results are similar for DataFrames.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@WillAyd
Copy link
Member

WillAyd commented May 31, 2018

Typically for performance-related changes we look for an ASV to measure and track over time. Can you add one to asv_bench/benchmarks/frame_methods.py and post the results of the benchmark here?

@WillAyd WillAyd added the Performance Memory or execution speed performance label May 31, 2018
@codecov
Copy link

codecov bot commented May 31, 2018

Codecov Report

Merging #21274 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21274      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49549       +3     
==========================================
+ Hits        45509    45512       +3     
  Misses       4037     4037
Flag Coverage Δ
#multiple 90.25% <ø> (ø) ⬆️
#single 41.87% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.12% <ø> (ø) ⬆️
pandas/io/formats/csvs.py 98.14% <0%> (+0.01%) ⬆️
pandas/core/indexes/interval.py 93.16% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbec58e...6dda68e. Read the comment docs.

@DataOmbudsman
Copy link
Contributor Author

Sure. Thanks for the suggestion. Here are my ASV benchmarks. These also show the improvement.

Setup

class Describe(object):

    goal_time = 0.2

    def setup(self):
        np.random.seed(123)
        self.df = DataFrame({
            'a': np.random.randint(0, 100, int(1e6)),
            'b': np.random.randint(0, 100, int(1e6)),
            'c': np.random.randint(0, 100, int(1e6)),
        })

    def time_series_describe(self):
        self.df['a'].describe()

    def time_dataframe_describe(self):
        self.df.describe()

Results

before after ratio
689±10ms 495±6ms 0.72 frame_methods.Describe.time_dataframe_describe
234±9ms 166±6ms 0.71 frame_methods.Describe.time_series_describe

@WillAyd
Copy link
Member

WillAyd commented Jun 2, 2018

OK thanks. Can you update your PR to include the benchmark and a whatsnew note for 0.24?

@DataOmbudsman DataOmbudsman force-pushed the improve-ndframe-describe-performance branch from 724f30e to 70668a1 Compare June 4, 2018 11:33
@pep8speaks
Copy link

pep8speaks commented Jun 4, 2018

Hello @DataOmbudsman! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 05, 2018 at 12:40 Hours UTC

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jun 4, 2018
goal_time = 0.2

def setup(self):
np.random.seed(123)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the random seed; this is handled when setup is imported at the top (from .pandas_vb_common import setup)

@@ -63,8 +63,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

-
-
- Improved performance of :func:`Series.describe` in case of numeric dtpyes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number (this pr number as we don't have an issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but I'm unsure about what format is expected. Do you think a link to an external URL (such as here) would be appropriate? E.g., `pull request #21274 <https://github.com/pandas-dev/pandas/pull/21274/>`_. Or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same format as all the others, just use :issue:`number`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that the URL of the issue is translated to the URL of the PR. That's great.

@jorisvandenbossche jorisvandenbossche merged commit 7dc6f70 into pandas-dev:master Jun 5, 2018
@jorisvandenbossche
Copy link
Member

@DataOmbudsman Thanks!

@DataOmbudsman DataOmbudsman deleted the improve-ndframe-describe-performance branch June 8, 2018 08:12
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants