PERF: improve performance of groupby rank (#21237) #21285

peterpanmj · 2018-06-01T13:39:35Z

closes PERF: groupby rank is slow when tie count is big #21237
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2018-06-01T14:10:14Z

what does the asv look like? eg the perf improvement

codecov · 2018-06-01T14:26:24Z

Codecov Report

Merging #21285 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #21285      +/-   ##
==========================================
- Coverage   91.89%   91.89%   -0.01%     
==========================================
  Files         153      153              
  Lines       49600    49600              
==========================================
- Hits        45580    45578       -2     
- Misses       4020     4022       +2

Flag	Coverage Δ
#multiple	`90.29% <ø> (-0.01%)`	⬇️
#single	`41.86% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/util/testing.py	`84.6% <0%> (-0.21%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab668b0...fbb05d4. Read the comment docs.

peterpanmj · 2018-06-02T08:35:45Z

The following is the output of asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods

· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For pandas commit hash cb3f778:
[ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running groupby.GroupByMethods.time_dtype_as_field 53.5±4μs;...
[ 50.00%] ··· Running groupby.GroupByMethods.time_dtype_as_group 56.5±0μs;...
[ 50.00%] · For pandas commit hash 4274b84:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running groupby.GroupByMethods.time_dtype_as_field 55.8±0μs;...
[100.00%] ··· Running groupby.GroupByMethods.time_dtype_as_group 55.8±0μs;...
before after ratio
[4274b84] [cb3f778]
1.16s 1.00s 0.86 groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

peterpanmj · 2018-06-02T12:00:05Z

When a single value occur many times , the current function is very slow, due constant writing to the array out
Master:

df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})

In [31]: %%timeit
    ...: t = df.groupby("B").rank()

608 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

In [2]: df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})

In [3]: %%timeit
   ...: t = df.groupby("B").rank()
   ...:
4.03 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jreback · 2018-06-02T12:08:46Z

can we add this as an asv?

jreback

also pls add a whatsnew for 0.24

jreback · 2018-06-02T12:09:19Z

pandas/_libs/groupby_helper.pxi.in

-                # could potentially be optimized to only write to the
-                # result once the last duplicate value is encountered
+                grp_vals_seen = i + 1
+                if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]):


can u add some comments here on the optimization

WillAyd · 2018-06-04T16:04:07Z

doc/source/whatsnew/v0.24.0.txt

@@ -63,7 +63,7 @@ Removal of prior version deprecations/changes
 Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

-
+- Improved performance of :func:`pandas.core.groupby.GroupBy.rank` (:issue:`21237`)


Should just further clarify that this applies "when dealing with tied rankings"

WillAyd · 2018-06-04T16:06:35Z

pandas/_libs/groupby_helper.pxi.in

+                grp_vals_seen = i + 1
+                # when label transition happens, update grp_size to keep track
+                # of number of nans encountered and increment grp_tie_count
+                if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]):


Does this section pass LINTing? Whitespace between operators and operands looks off

usually I use flake8 <python-filepath>. (I am using a Windows machine) But since it is not a standard .py file. There are too many pep8 issues listing. I have to find out the real issue by examining each of them. Any suggestions on a better way to deal with it ?

Hmm ok that’s right I suppose these files are not flakeable. Just make sure to have whitespace on both sides of the relational operators

WillAyd · 2018-06-04T16:08:00Z

pandas/_libs/groupby_helper.pxi.in

+                grp_vals_seen = i + 1
+                # when label transition happens, update grp_size to keep track
+                # of number of nans encountered and increment grp_tie_count
+                if (i== N-1) or (labels[_as[i]] != labels[_as[i+1]]):


Given this same condition is evaluated later within the function, is there not a way to consolidate and have it only appear once?

peterpanmj · 2018-06-11T02:53:36Z

Here is the output of newly added asv bench groupby.RankWithTies

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pyta
bles-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytabl
es-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 690fac2f:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running groupby.RankWithTies.time_rank_ties          1.61±0ms;
...
[ 50.00%] · For pandas commit hash 0c65c57a:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl
-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running groupby.RankWithTies.time_rank_ties          46.8±0ms;
...       before           after         ratio
     [0c65c57a]       [690fac2f]
-        31.2±3ms       1.71±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('datetime64', 'dense')
-        31.2±3ms         1.65±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('int64', 'dense')
-        31.2±2ms      1.61±0.09ms     0.05  groupby.RankWithTies.time_rank_ti
es('float64', 'max')
-        31.2±2ms         1.61±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('float64', 'min')
-        33.8±3ms      1.73±0.09ms     0.05  groupby.RankWithTies.time_rank_ti
es('float32', 'max')
-        31.2±3ms       1.59±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('float32', 'dense')
-        31.2±2ms       1.53±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('float64', 'dense')
-        31.2±0ms       1.51±0.1ms     0.05  groupby.RankWithTies.time_rank_ti
es('float32', 'min')
-        33.8±3ms         1.61±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('int64', 'min')
-        33.8±3ms      1.56±0.03ms     0.05  groupby.RankWithTies.time_rank_ti
es('datetime64', 'min')
-        35.1±2ms         1.61±0ms     0.05  groupby.RankWithTies.time_rank_ti
es('datetime64', 'max')
-        33.8±3ms         1.49±0ms     0.04  groupby.RankWithTies.time_rank_ti
es('int64', 'max')
-        46.8±0ms       1.76±0.1ms     0.04  groupby.RankWithTies.time_rank_ti
es('datetime64', 'first')
-        46.8±0ms         1.71±0ms     0.04  groupby.RankWithTies.time_rank_ti
es('int64', 'first')
-        46.8±0ms         1.61±0ms     0.03  groupby.RankWithTies.time_rank_ti
es('float64', 'first')
-        46.8±0ms       1.56±0.1ms     0.03  groupby.RankWithTies.time_rank_ti
es('float32', 'first')
-         203±0ms       1.76±0.1ms     0.01  groupby.RankWithTies.time_rank_ti
es('int64', 'average')
-         203±6ms         1.61±0ms     0.01  groupby.RankWithTies.time_rank_ti
es('datetime64', 'average')
-         203±0ms       1.59±0.1ms     0.01  groupby.RankWithTies.time_rank_ti
es('float64', 'average')
-         203±6ms         1.59±0ms     0.01  groupby.RankWithTies.time_rank_ti
es('float32', 'average')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2018-06-11T03:12:58Z

doc/source/whatsnew/v0.24.0.txt

@@ -63,7 +63,8 @@ Removal of prior version deprecations/changes
 Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

- Improved performance of :func:`Series.describe` in case of numeric dtpyes (:issue:`21274`)


looks like u removed another entry?

I didn't notice that. It is a mistake I made by accident.

WillAyd · 2018-06-11T15:25:08Z

pandas/_libs/groupby_helper.pxi.in

+                # reset the dups and sum_ranks, knowing that a new value is
+                # coming up. the conditional also needs to handle nan equality
+                # and the end of iteration
+                if (i == N - 1 or


Can this be simplified to just:

if labels[_as[i]] == labels[_as[i+1]]:

Since all of the other conditions are already accounted for higher up in the scope?

WillAyd · 2018-06-11T15:27:21Z

pandas/_libs/groupby_helper.pxi.in

+                # decrement that from their position. fill in the size of each
+                # group encountered (used by pct calculations later). also be
+                # sure to reset any of the items helping to calculate dups
+                if i == N - 1 or labels[_as[i]] != labels[_as[i+1]]:


If the above is true could maybe use else more effective here to reduce code

How about i == N -1 scenario ? I can rewrite it to the following

If labels[_as[i]] == labels[_as[i+1]]: (_increment temp values_) else: (_setting grp_size_)

This is not equivalent to my current fix.
Those two clauses are mutually exclusive. Then I have to do,

labels[_as[i]] == labels[_as[i+1]] and i != N -1: ( _increment temp values_) elif i==N-1: (_increment temp values and set grp_size_) else: (_set grp_size_)

It is less readable

Hmm OK thanks for talking me through - I think you are right that that wouldn't improve readability.

jreback

tiny comment and 1 comment left from @WillAyd

jreback · 2018-06-12T11:13:44Z

asv_bench/benchmarks/groupby.py

+        else:
+            data = np.array([1] * N, dtype=dtype)
+        self.df = DataFrame({'values': data, 'key': ['foo'] * N})
+        self.method = tie_method


you don't actually need the method here

jreback · 2018-06-12T11:13:57Z

asv_bench/benchmarks/groupby.py

+        self.method = tie_method
+
+    def time_rank_ties(self, dtype, tie_method):
+        self.df.groupby('key').rank(method=self.method)


use tie_method directly

jreback · 2018-06-14T10:09:01Z

thanks @peterpanmj

…dev#21285)

jschendel added Groupby Performance Memory or execution speed performance labels Jun 1, 2018

jreback requested changes Jun 2, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jun 3, 2018

WillAyd requested changes Jun 4, 2018

View reviewed changes

jreback requested changes Jun 11, 2018

View reviewed changes

WillAyd reviewed Jun 11, 2018

View reviewed changes

jreback requested changes Jun 12, 2018

View reviewed changes

WillAyd approved these changes Jun 12, 2018

View reviewed changes

PERF: improve performance of groupby rank (pandas-dev#21237)

fbb05d4

peterpanmj force-pushed the group_rank_perf branch from 1baae6d to fbb05d4 Compare June 13, 2018 04:02

jreback approved these changes Jun 14, 2018

View reviewed changes

jreback merged commit 2a33926 into pandas-dev:master Jun 14, 2018

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

PERF: improve performance of groupby rank (pandas-dev#21237) (pandas-…

c78c269

…dev#21285)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

PERF: improve performance of groupby rank (pandas-dev#21237) (pandas-…

591248f

…dev#21285)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improve performance of groupby rank (#21237) #21285

PERF: improve performance of groupby rank (#21237) #21285

peterpanmj commented Jun 1, 2018 •

edited

Loading

jreback commented Jun 1, 2018

codecov bot commented Jun 1, 2018 •

edited

Loading

peterpanmj commented Jun 2, 2018

peterpanmj commented Jun 2, 2018

jreback commented Jun 2, 2018

jreback left a comment

jreback Jun 2, 2018

WillAyd Jun 4, 2018

WillAyd Jun 4, 2018

peterpanmj Jun 6, 2018

WillAyd Jun 6, 2018 •

edited

Loading

WillAyd Jun 4, 2018

peterpanmj commented Jun 11, 2018 •

edited by WillAyd

Loading

jreback Jun 11, 2018

peterpanmj Jun 11, 2018

WillAyd Jun 11, 2018

WillAyd Jun 11, 2018

peterpanmj Jun 12, 2018 •

edited

Loading

WillAyd Jun 12, 2018

jreback left a comment

jreback Jun 12, 2018

jreback Jun 12, 2018

jreback commented Jun 14, 2018

PERF: improve performance of groupby rank (#21237) #21285

PERF: improve performance of groupby rank (#21237) #21285

Conversation

peterpanmj commented Jun 1, 2018 • edited Loading

jreback commented Jun 1, 2018

codecov bot commented Jun 1, 2018 • edited Loading

Codecov Report

peterpanmj commented Jun 2, 2018

peterpanmj commented Jun 2, 2018

jreback commented Jun 2, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterpanmj commented Jun 11, 2018 • edited by WillAyd Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterpanmj Jun 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 14, 2018

peterpanmj commented Jun 1, 2018 •

edited

Loading

codecov bot commented Jun 1, 2018 •

edited

Loading

WillAyd Jun 6, 2018 •

edited

Loading

peterpanmj commented Jun 11, 2018 •

edited by WillAyd

Loading

peterpanmj Jun 12, 2018 •

edited

Loading