-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improve performance of NDFrame.describe #21274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: improve performance of NDFrame.describe #21274
Conversation
Typically for performance-related changes we look for an ASV to measure and track over time. Can you add one to |
Codecov Report
@@ Coverage Diff @@
## master #21274 +/- ##
==========================================
+ Coverage 91.85% 91.85% +<.01%
==========================================
Files 153 153
Lines 49546 49549 +3
==========================================
+ Hits 45509 45512 +3
Misses 4037 4037
Continue to review full report at Codecov.
|
Sure. Thanks for the suggestion. Here are my ASV benchmarks. These also show the improvement. Setup
Results
|
OK thanks. Can you update your PR to include the benchmark and a whatsnew note for 0.24? |
Calculating percentiles in one pass is faster than separately.
724f30e
to
70668a1
Compare
Hello @DataOmbudsman! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 05, 2018 at 12:40 Hours UTC |
goal_time = 0.2 | ||
|
||
def setup(self): | ||
np.random.seed(123) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the random seed; this is handled when setup
is imported at the top (from .pandas_vb_common import setup
)
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -63,8 +63,7 @@ Removal of prior version deprecations/changes | |||
Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- | |||
- | |||
- Improved performance of :func:`Series.describe` in case of numeric dtpyes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number (this pr number as we don't have an issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but I'm unsure about what format is expected. Do you think a link to an external URL (such as here) would be appropriate? E.g., `pull request #21274 <https://github.com/pandas-dev/pandas/pull/21274/>`_
. Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same format as all the others, just use :issue:`number`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that the URL of the issue is translated to the URL of the PR. That's great.
@DataOmbudsman Thanks! |
A one-line change that enables to calculate the percentiles in
describe
more efficiently. The point is that calculating percentiles in one pass is faster than separately.describe
(with defaultpercentiles
argument) becomes 25-30% faster than before for numerical Series and DataFrames.Setup
Benchmark
Results
On master:
With this change:
Results are similar for DataFrames.
git diff upstream/master -u -- "*.py" | flake8 --diff