Skip to content

PERF: improve performance of groupby rank (#21237) #21285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 14, 2018

Conversation

peterpanmj
Copy link
Contributor

@peterpanmj peterpanmj commented Jun 1, 2018

@jreback
Copy link
Contributor

jreback commented Jun 1, 2018

what does the asv look like? eg the perf improvement

@codecov
Copy link

codecov bot commented Jun 1, 2018

Codecov Report

Merging #21285 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21285      +/-   ##
==========================================
- Coverage   91.89%   91.89%   -0.01%     
==========================================
  Files         153      153              
  Lines       49600    49600              
==========================================
- Hits        45580    45578       -2     
- Misses       4020     4022       +2
Flag Coverage Δ
#multiple 90.29% <ø> (-0.01%) ⬇️
#single 41.86% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 84.6% <0%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab668b0...fbb05d4. Read the comment docs.

@jschendel jschendel added Groupby Performance Memory or execution speed performance labels Jun 1, 2018
@peterpanmj
Copy link
Contributor Author

The following is the output of asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods

· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For pandas commit hash cb3f778:
[ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running groupby.GroupByMethods.time_dtype_as_field 53.5±4μs;...
[ 50.00%] ··· Running groupby.GroupByMethods.time_dtype_as_group 56.5±0μs;...
[ 50.00%] · For pandas commit hash 4274b84:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running groupby.GroupByMethods.time_dtype_as_field 55.8±0μs;...
[100.00%] ··· Running groupby.GroupByMethods.time_dtype_as_group 55.8±0μs;...
before after ratio
[4274b84] [cb3f778]
1.16s 1.00s 0.86 groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@peterpanmj
Copy link
Contributor Author

When a single value occur many times , the current function is very slow, due constant writing to the array out
Master:

df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})

In [31]: %%timeit
    ...: t = df.groupby("B").rank()

608 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

In [2]: df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})

In [3]: %%timeit
   ...: t = df.groupby("B").rank()
   ...:
4.03 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jreback
Copy link
Contributor

jreback commented Jun 2, 2018

can we add this as an asv?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls add a whatsnew for 0.24

# could potentially be optimized to only write to the
# result once the last duplicate value is encountered
grp_vals_seen = i + 1
if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add some comments here on the optimization

@jreback jreback added this to the 0.24.0 milestone Jun 3, 2018
@@ -63,7 +63,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

-
- Improved performance of :func:`pandas.core.groupby.GroupBy.rank` (:issue:`21237`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just further clarify that this applies "when dealing with tied rankings"

grp_vals_seen = i + 1
# when label transition happens, update grp_size to keep track
# of number of nans encountered and increment grp_tie_count
if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this section pass LINTing? Whitespace between operators and operands looks off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually I use flake8 <python-filepath>. (I am using a Windows machine) But since it is not a standard .py file. There are too many pep8 issues listing. I have to find out the real issue by examining each of them. Any suggestions on a better way to deal with it ?

Copy link
Member

@WillAyd WillAyd Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok that’s right I suppose these files are not flakeable. Just make sure to have whitespace on both sides of the relational operators

grp_vals_seen = i + 1
# when label transition happens, update grp_size to keep track
# of number of nans encountered and increment grp_tie_count
if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this same condition is evaluated later within the function, is there not a way to consolidate and have it only appear once?

@peterpanmj
Copy link
Contributor Author

peterpanmj commented Jun 11, 2018

Here is the output of newly added asv bench groupby.RankWithTies

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pyta
bles-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytabl
es-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 690fac2f:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running groupby.RankWithTies.time_rank_ties          1.61±0ms;
...
[ 50.00%] · For pandas commit hash 0c65c57a:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running groupby.RankWithTies.time_rank_ties          46.8±0ms;
...       before           after         ratio
     [0c65c57a]       [690fac2f]
-        31.2±3ms       1.71±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('datetime64', 'dense')
-        31.2±3ms         1.65±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('int64', 'dense')
-        31.2±2ms      1.61±0.09ms     0.05  groupby.RankWithTies.time_rank_ti
es('float64', 'max')
-        31.2±2ms         1.61±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('float64', 'min')
-        33.8±3ms      1.73±0.09ms     0.05  groupby.RankWithTies.time_rank_ti
es('float32', 'max')
-        31.2±3ms       1.59±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('float32', 'dense')
-        31.2±2ms       1.53±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('float64', 'dense')
-        31.2±0ms       1.51±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('float32', 'min')
-        33.8±3ms         1.61±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('int64', 'min')
-        33.8±3ms      1.56±0.03ms     0.05  groupby.RankWithTies.time_rank_ti
es('datetime64', 'min')
-        35.1±2ms         1.61±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('datetime64', 'max')
-        33.8±3ms         1.49±0ms     0.04  groupby.RankWithTies.time_rank_ti
es('int64', 'max')
-        46.8±0ms       1.76±0.1ms     0.04  groupby.RankWithTies.time_rank_ti
es('datetime64', 'first')
-        46.8±0ms         1.71±0ms     0.04  groupby.RankWithTies.time_rank_ti
es('int64', 'first')
-        46.8±0ms         1.61±0ms     0.03  groupby.RankWithTies.time_rank_ti
es('float64', 'first')
-        46.8±0ms       1.56±0.1ms     0.03  groupby.RankWithTies.time_rank_ti
es('float32', 'first')
-         203±0ms       1.76±0.1ms     0.01  groupby.RankWithTies.time_rank_ti
es('int64', 'average')
-         203±6ms         1.61±0ms     0.01  groupby.RankWithTies.time_rank_ti
es('datetime64', 'average')
-         203±0ms       1.59±0.1ms     0.01  groupby.RankWithTies.time_rank_ti
es('float64', 'average')
-         203±6ms         1.59±0ms     0.01  groupby.RankWithTies.time_rank_ti
es('float32', 'average')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@@ -63,7 +63,8 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved performance of :func:`Series.describe` in case of numeric dtpyes (:issue:`21274`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like u removed another entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice that. It is a mistake I made by accident.

# reset the dups and sum_ranks, knowing that a new value is
# coming up. the conditional also needs to handle nan equality
# and the end of iteration
if (i == N - 1 or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified to just:

if labels[_as[i]] == labels[_as[i+1]]:

Since all of the other conditions are already accounted for higher up in the scope?

# decrement that from their position. fill in the size of each
# group encountered (used by pct calculations later). also be
# sure to reset any of the items helping to calculate dups
if i == N - 1 or labels[_as[i]] != labels[_as[i+1]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above is true could maybe use else more effective here to reduce code

Copy link
Contributor Author

@peterpanmj peterpanmj Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about i == N -1 scenario ? I can rewrite it to the following

If labels[_as[i]] == labels[_as[i+1]]:  (_increment temp values_) 
 else: (_setting grp_size_)

This is not equivalent to my current fix.
Those two clauses are mutually exclusive. Then I have to do,

labels[_as[i]] == labels[_as[i+1]] and i != N -1: ( _increment temp values_) 
elif i==N-1: (_increment temp values and set grp_size_)
else: (_set grp_size_)

It is less readable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK thanks for talking me through - I think you are right that that wouldn't improve readability.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comment and 1 comment left from @WillAyd

else:
data = np.array([1] * N, dtype=dtype)
self.df = DataFrame({'values': data, 'key': ['foo'] * N})
self.method = tie_method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't actually need the method here

self.method = tie_method

def time_rank_ties(self, dtype, tie_method):
self.df.groupby('key').rank(method=self.method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tie_method directly

@jreback jreback merged commit 2a33926 into pandas-dev:master Jun 14, 2018
@jreback
Copy link
Contributor

jreback commented Jun 14, 2018

thanks @peterpanmj

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: groupby rank is slow when tie count is big
4 participants