Skip to content

PERF: Define PeriodArray._values_for_argsort #24083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 4, 2018

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Dec 4, 2018

This PR speeds up .groupby() and .set_index() operations involving a PeriodArray by 25-64x:

asv compare upstream/master HEAD -s --sort ratio

Benchmarks that have improved:

       before           after         ratio
     [08395af4]       [696b40f1]
     <period_array_argsort~1>       <parse_time_string>
-       4.77±0.1s          191±3ms     0.04  period.DataFramePeriodColumn.time_set_index
-       2.23±0.2s         35.6±2ms     0.02  groupby.Datelike.time_sum('period_range')

The underlying issue was that pd.core.algorithms.factorize() calls argsort() on the input arrays. Calling this resulted in raw Period objects being sorted via equality comparisons that also generated Offset objects. Assuming all elements of the array have the same frequency, we can simply sort the underlying ordinals and achieve the same result far faster.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

Hello @qwhelan! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Dec 4, 2018

Codecov Report

Merging #24083 into master will increase coverage by 49.74%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24083       +/-   ##
===========================================
+ Coverage   42.51%   92.25%   +49.74%     
===========================================
  Files         161      161               
  Lines       51689    51691        +2     
===========================================
+ Hits        21974    47689    +25715     
+ Misses      29715     4002    -25713
Flag Coverage Δ
#multiple 90.66% <100%> (?)
#single 42.51% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/period.py 98.45% <100%> (+61.56%) ⬆️
pandas/core/computation/pytables.py 92.37% <0%> (+0.3%) ⬆️
pandas/io/pytables.py 92.3% <0%> (+0.92%) ⬆️
pandas/util/_test_decorators.py 93.24% <0%> (+4.05%) ⬆️
pandas/compat/__init__.py 58.36% <0%> (+8.17%) ⬆️
pandas/core/config_init.py 99.24% <0%> (+9.84%) ⬆️
pandas/core/reshape/util.py 100% <0%> (+11.53%) ⬆️
pandas/compat/numpy/__init__.py 92.85% <0%> (+14.28%) ⬆️
pandas/core/computation/common.py 85.71% <0%> (+14.28%) ⬆️
pandas/core/api.py 100% <0%> (+14.81%) ⬆️
... and 120 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3fe697f...e7a5744. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Dec 4, 2018

Codecov Report

Merging #24083 into master will increase coverage by 49.74%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24083       +/-   ##
===========================================
+ Coverage   42.51%   92.25%   +49.74%     
===========================================
  Files         161      161               
  Lines       51689    51691        +2     
===========================================
+ Hits        21974    47689    +25715     
+ Misses      29715     4002    -25713
Flag Coverage Δ
#multiple 90.66% <100%> (?)
#single 42.51% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/period.py 98.45% <100%> (+61.56%) ⬆️
pandas/core/computation/pytables.py 92.37% <0%> (+0.3%) ⬆️
pandas/io/pytables.py 92.3% <0%> (+0.92%) ⬆️
pandas/util/_test_decorators.py 93.24% <0%> (+4.05%) ⬆️
pandas/compat/__init__.py 58.36% <0%> (+8.17%) ⬆️
pandas/core/config_init.py 99.24% <0%> (+9.84%) ⬆️
pandas/core/reshape/util.py 100% <0%> (+11.53%) ⬆️
pandas/compat/numpy/__init__.py 92.85% <0%> (+14.28%) ⬆️
pandas/core/computation/common.py 85.71% <0%> (+14.28%) ⬆️
pandas/core/api.py 100% <0%> (+14.81%) ⬆️
... and 120 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3fe697f...e7a5744. Read the comment docs.

@jreback jreback added Performance Memory or execution speed performance Period Period data type labels Dec 4, 2018
@jreback jreback added this to the 0.24.0 milestone Dec 4, 2018
@jreback jreback merged commit 3a609ea into pandas-dev:master Dec 4, 2018
@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

thanks @qwhelan

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants