Skip to content

BUG/REF: use sorted_rank_1d for rank_2d #41931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jun 25, 2021

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Jun 10, 2021

Built on #41916

There is a slight slowdown in 3 benchmarks, due to use of lexsort instead of argsort to sort with both the data and mask so that na_option can be properly handled. I am not sure if this can be avoided (though as a plus there is also potential perf improvement for users running code in parallel since the ranking is in a nogil block now).

Benchmarks:

[499ef8c0]       [d678bbf0]
<master>         <ref/rank_2d_dedup>
7.14±0.1ms       7.04±0.4ms     0.99  categoricals.Rank.time_rank_int
7.85±0.3ms       7.34±0.2ms     0.93  categoricals.Rank.time_rank_int_cat
7.30±0.2ms       6.95±0.6ms     0.95  categoricals.Rank.time_rank_int_cat_ordered
121±3ms         122±10ms     1.01  categoricals.Rank.time_rank_string
8.66±1ms         8.13±1ms     0.94  categoricals.Rank.time_rank_string_cat
6.37±1ms       6.11±0.4ms     0.96  categoricals.Rank.time_rank_string_cat_ordered
9.34±2ms         10.1±1ms     1.08  frame_methods.Rank.time_rank('float')
3.81±1ms       4.16±0.4ms     1.09  frame_methods.Rank.time_rank('int')
59.0±20ms       48.0±0.9ms    ~0.81  frame_methods.Rank.time_rank('object')
2.75±0.4ms       3.53±0.7ms    ~1.28  frame_methods.Rank.time_rank('uint')
1.06±0.04ms      1.01±0.04ms     0.95  groupby.RankWithTies.time_rank_ties('datetime64', 'average')
1.13±0.06ms      1.02±0.04ms    ~0.90  groupby.RankWithTies.time_rank_ties('datetime64', 'dense')
1.19±0.1ms      1.04±0.03ms    ~0.87  groupby.RankWithTies.time_rank_ties('datetime64', 'first')
1.12±0.04ms      1.04±0.06ms     0.93  groupby.RankWithTies.time_rank_ties('datetime64', 'max')
1.03±0.05ms      1.03±0.04ms     1.00  groupby.RankWithTies.time_rank_ties('datetime64', 'min')
1.22±0.1ms      1.10±0.07ms    ~0.90  groupby.RankWithTies.time_rank_ties('float32', 'average')
1.13±0.09ms      1.12±0.04ms     0.99  groupby.RankWithTies.time_rank_ties('float32', 'dense')
1.15±0.1ms      1.05±0.05ms     0.91  groupby.RankWithTies.time_rank_ties('float32', 'first')
-     1.19±0.05ms      1.00±0.03ms     0.84  groupby.RankWithTies.time_rank_ties('float32', 'max')
1.18±0.04ms      1.13±0.08ms     0.96  groupby.RankWithTies.time_rank_ties('float32', 'min')
1.25±0.1ms      1.17±0.03ms     0.94  groupby.RankWithTies.time_rank_ties('float64', 'average')
1.45±0.5ms         990±10μs    ~0.68  groupby.RankWithTies.time_rank_ties('float64', 'dense')
1.21±0.2ms      1.11±0.08ms     0.92  groupby.RankWithTies.time_rank_ties('float64', 'first')
1.18±0.05ms      1.15±0.07ms     0.97  groupby.RankWithTies.time_rank_ties('float64', 'max')
1.32±0.3ms         963±40μs    ~0.73  groupby.RankWithTies.time_rank_ties('float64', 'min')
1.03±0.1ms         971±80μs     0.94  groupby.RankWithTies.time_rank_ties('int64', 'average')
1.02±0.06ms      1.02±0.06ms     1.01  groupby.RankWithTies.time_rank_ties('int64', 'dense')
1.13±0.08ms      1.01±0.04ms    ~0.89  groupby.RankWithTies.time_rank_ties('int64', 'first')
1.24±0.1ms         959±20μs    ~0.77  groupby.RankWithTies.time_rank_ties('int64', 'max')
1.14±0.06ms         979±50μs    ~0.86  groupby.RankWithTies.time_rank_ties('int64', 'min')
10.9±0.6ms       9.80±0.4ms    ~0.90  series_methods.Rank.time_rank('float')
7.18±0.3ms       7.15±0.6ms     1.00  series_methods.Rank.time_rank('int')
53.7±1ms         49.6±4ms     0.92  series_methods.Rank.time_rank('object')
7.64±0.2ms       7.78±0.6ms     1.02  series_methods.Rank.time_rank('uint')
9.96±0.2ms       11.7±0.3ms    ~1.17  stat_ops.Rank.time_average_old('DataFrame', False)
+      9.89±0.2ms       11.9±0.6ms     1.20  stat_ops.Rank.time_average_old('DataFrame', True)
13.5±0.9ms       12.6±0.8ms     0.93  stat_ops.Rank.time_average_old('Series', False)
12.8±1ms       11.9±0.6ms     0.93  stat_ops.Rank.time_average_old('Series', True)
+      9.84±0.3ms       11.7±0.4ms     1.19  stat_ops.Rank.time_rank('DataFrame', False)
+      8.53±0.3ms       11.7±0.8ms     1.38  stat_ops.Rank.time_rank('DataFrame', True)
14.0±3ms         12.2±1ms    ~0.87  stat_ops.Rank.time_rank('Series', False)
13.0±0.9ms         12.1±1ms     0.93  stat_ops.Rank.time_rank('Series', True)

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Refactor Internal refactoring of code labels Jun 10, 2021
@jreback jreback added this to the 1.4 milestone Jun 17, 2021
@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

lgtm. @jbrockmendel

@jbrockmendel
Copy link
Member

LGTM; caveat im not that familiar with the cases in #19560

@jreback jreback merged commit 7a38d63 into pandas-dev:master Jun 25, 2021
@jreback
Copy link
Contributor

jreback commented Jun 25, 2021

thanks @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the ref/rank_2d_dedup branch June 25, 2021 17:41
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raise ValueError When Attempting to Rank Object Dtypes
3 participants