-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improve performance of groupby rank (#21237) #21285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
what does the asv look like? eg the perf improvement |
Codecov Report
@@ Coverage Diff @@
## master #21285 +/- ##
==========================================
- Coverage 91.89% 91.89% -0.01%
==========================================
Files 153 153
Lines 49600 49600
==========================================
- Hits 45580 45578 -2
- Misses 4020 4022 +2
Continue to review full report at Codecov.
|
The following is the output of
|
When a single value occur many times , the current function is very slow, due constant writing to the array df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})
In [31]: %%timeit
...: t = df.groupby("B").rank()
608 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) After: In [2]: df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})
In [3]: %%timeit
...: t = df.groupby("B").rank()
...:
4.03 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
can we add this as an asv? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls add a whatsnew for 0.24
pandas/_libs/groupby_helper.pxi.in
Outdated
# could potentially be optimized to only write to the | ||
# result once the last duplicate value is encountered | ||
grp_vals_seen = i + 1 | ||
if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add some comments here on the optimization
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -63,7 +63,7 @@ Removal of prior version deprecations/changes | |||
Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- | |||
- Improved performance of :func:`pandas.core.groupby.GroupBy.rank` (:issue:`21237`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just further clarify that this applies "when dealing with tied rankings"
pandas/_libs/groupby_helper.pxi.in
Outdated
grp_vals_seen = i + 1 | ||
# when label transition happens, update grp_size to keep track | ||
# of number of nans encountered and increment grp_tie_count | ||
if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this section pass LINTing? Whitespace between operators and operands looks off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually I use flake8 <python-filepath>
. (I am using a Windows machine) But since it is not a standard .py file. There are too many pep8 issues listing. I have to find out the real issue by examining each of them. Any suggestions on a better way to deal with it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok that’s right I suppose these files are not flakeable. Just make sure to have whitespace on both sides of the relational operators
pandas/_libs/groupby_helper.pxi.in
Outdated
grp_vals_seen = i + 1 | ||
# when label transition happens, update grp_size to keep track | ||
# of number of nans encountered and increment grp_tie_count | ||
if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this same condition is evaluated later within the function, is there not a way to consolidate and have it only appear once?
Here is the output of newly added asv bench groupby.RankWithTies · Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pyta
bles-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytabl
es-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pandas commit hash 690fac2f:
[ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running groupby.RankWithTies.time_rank_ties 1.61±0ms;
...
[ 50.00%] · For pandas commit hash 0c65c57a:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running groupby.RankWithTies.time_rank_ties 46.8±0ms;
... before after ratio
[0c65c57a] [690fac2f]
- 31.2±3ms 1.71±0.1ms 0.05 groupby.RankWithTies.time_rank_ti
es('datetime64', 'dense')
- 31.2±3ms 1.65±0ms 0.05 groupby.RankWithTies.time_rank_ti
es('int64', 'dense')
- 31.2±2ms 1.61±0.09ms 0.05 groupby.RankWithTies.time_rank_ti
es('float64', 'max')
- 31.2±2ms 1.61±0ms 0.05 groupby.RankWithTies.time_rank_ti
es('float64', 'min')
- 33.8±3ms 1.73±0.09ms 0.05 groupby.RankWithTies.time_rank_ti
es('float32', 'max')
- 31.2±3ms 1.59±0.1ms 0.05 groupby.RankWithTies.time_rank_ti
es('float32', 'dense')
- 31.2±2ms 1.53±0.1ms 0.05 groupby.RankWithTies.time_rank_ti
es('float64', 'dense')
- 31.2±0ms 1.51±0.1ms 0.05 groupby.RankWithTies.time_rank_ti
es('float32', 'min')
- 33.8±3ms 1.61±0ms 0.05 groupby.RankWithTies.time_rank_ti
es('int64', 'min')
- 33.8±3ms 1.56±0.03ms 0.05 groupby.RankWithTies.time_rank_ti
es('datetime64', 'min')
- 35.1±2ms 1.61±0ms 0.05 groupby.RankWithTies.time_rank_ti
es('datetime64', 'max')
- 33.8±3ms 1.49±0ms 0.04 groupby.RankWithTies.time_rank_ti
es('int64', 'max')
- 46.8±0ms 1.76±0.1ms 0.04 groupby.RankWithTies.time_rank_ti
es('datetime64', 'first')
- 46.8±0ms 1.71±0ms 0.04 groupby.RankWithTies.time_rank_ti
es('int64', 'first')
- 46.8±0ms 1.61±0ms 0.03 groupby.RankWithTies.time_rank_ti
es('float64', 'first')
- 46.8±0ms 1.56±0.1ms 0.03 groupby.RankWithTies.time_rank_ti
es('float32', 'first')
- 203±0ms 1.76±0.1ms 0.01 groupby.RankWithTies.time_rank_ti
es('int64', 'average')
- 203±6ms 1.61±0ms 0.01 groupby.RankWithTies.time_rank_ti
es('datetime64', 'average')
- 203±0ms 1.59±0.1ms 0.01 groupby.RankWithTies.time_rank_ti
es('float64', 'average')
- 203±6ms 1.59±0ms 0.01 groupby.RankWithTies.time_rank_ti
es('float32', 'average')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -63,7 +63,8 @@ Removal of prior version deprecations/changes | |||
Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- Improved performance of :func:`Series.describe` in case of numeric dtpyes (:issue:`21274`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like u removed another entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice that. It is a mistake I made by accident.
# reset the dups and sum_ranks, knowing that a new value is | ||
# coming up. the conditional also needs to handle nan equality | ||
# and the end of iteration | ||
if (i == N - 1 or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be simplified to just:
if labels[_as[i]] == labels[_as[i+1]]:
Since all of the other conditions are already accounted for higher up in the scope?
# decrement that from their position. fill in the size of each | ||
# group encountered (used by pct calculations later). also be | ||
# sure to reset any of the items helping to calculate dups | ||
if i == N - 1 or labels[_as[i]] != labels[_as[i+1]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the above is true could maybe use else more effective here to reduce code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about i == N -1 scenario ? I can rewrite it to the following
If labels[_as[i]] == labels[_as[i+1]]: (_increment temp values_)
else: (_setting grp_size_)
This is not equivalent to my current fix.
Those two clauses are mutually exclusive. Then I have to do,
labels[_as[i]] == labels[_as[i+1]] and i != N -1: ( _increment temp values_)
elif i==N-1: (_increment temp values and set grp_size_)
else: (_set grp_size_)
It is less readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK thanks for talking me through - I think you are right that that wouldn't improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny comment and 1 comment left from @WillAyd
asv_bench/benchmarks/groupby.py
Outdated
else: | ||
data = np.array([1] * N, dtype=dtype) | ||
self.df = DataFrame({'values': data, 'key': ['foo'] * N}) | ||
self.method = tie_method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't actually need the method here
asv_bench/benchmarks/groupby.py
Outdated
self.method = tie_method | ||
|
||
def time_rank_ties(self, dtype, tie_method): | ||
self.df.groupby('key').rank(method=self.method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tie_method directly
1baae6d
to
fbb05d4
Compare
thanks @peterpanmj |
git diff upstream/master -u -- "*.py" | flake8 --diff