Skip to content

PERF: faster groupby diff #45575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 28, 2022
Merged

Conversation

lukemanley
Copy link
Member

As @jreback suggests in the original issue, there is a large perf improvement to groupby.diff if implemented directly as a groupby method.

The int8/int16 type coercion feels a bit odd, but was needed to retain existing behavior as well as pass existing unit tests that explicitly test those dtypes. Separate issue to discuss that: #45562

       before           after         ratio
     [2db3b0a0]       [6edfb839]
                      <faster-groupby-diff>
-     1.98±0.02ms      1.35±0.02ms     0.68  groupby.GroupByMethods.time_dtype_as_group('object', 'diff', 'transformation', 5)
-         799±9μs          472±6μs     0.59  groupby.GroupByMethods.time_dtype_as_group('object', 'diff', 'transformation', 1)
-         461±5μs        166±0.8μs     0.36  groupby.GroupByMethods.time_dtype_as_group('object', 'diff', 'direct', 1)
-        94.0±4ms       2.22±0.9ms     0.02  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 5)
-        69.1±1ms      1.51±0.02ms     0.02  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'transformation', 5)
-        69.3±2ms      1.48±0.01ms     0.02  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'transformation', 5)
-        78.5±4ms      1.45±0.01ms     0.02  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'transformation', 5)
-        94.0±6ms      1.46±0.01ms     0.02  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'transformation', 5)
-         122±3ms      1.55±0.01ms     0.01  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'transformation', 5)
-         140±2ms         1.45±0ms     0.01  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'transformation', 5)
-         140±2ms      1.41±0.02ms     0.01  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'transformation', 5)
-      66.0±0.7ms          502±5μs     0.01  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'transformation', 1)
-      64.3±0.5ms         488±10μs     0.01  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'transformation', 1)
-         106±8ms        767±200μs     0.01  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 1)
-        75.9±2ms         497±20μs     0.01  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'transformation', 1)
-        89.8±3ms        587±200μs     0.01  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'direct', 5)
-        88.5±1ms          526±5μs     0.01  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'direct', 5)
-         102±2ms         552±10μs     0.01  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'transformation', 1)
-         120±2ms          614±4μs     0.01  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'transformation', 1)
-       161±0.6ms          604±7μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'transformation', 1)
-         138±3ms          500±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'direct', 5)
-         139±3ms          498±5μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'direct', 5)
-         165±1ms          502±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'transformation', 1)
-        66.5±1ms          192±3μs     0.00  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'direct', 1)
-        71.2±1ms          201±3μs     0.00  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'direct', 5)
-         124±3ms          326±2μs     0.00  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'direct', 5)
-         119±2ms          306±3μs     0.00  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'direct', 1)
-        78.6±1ms          196±5μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 5)
-      70.1±0.8ms          170±5μs     0.00  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'direct', 5)
-        67.4±4ms          159±2μs     0.00  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'direct', 1)
-        78.4±1ms          183±2μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 1)
-         108±8ms         249±80μs     0.00  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'direct', 1)
-         102±3ms          182±2μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'direct', 1)
-         160±2ms          170±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'direct', 1)
-         165±3ms          171±2μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'direct', 1)

# int8 and int16 are coerced to float32 rather than float64.
dtypes_to_f32 = ["int8", "int16"]
if obj.ndim == 1:
if obj.dtype in dtypes_to_f32:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will only work on the series case. why is this necessary? i think upcasting is fine here (likey what we do now)

if obj.dtype in dtypes_to_f32:
shifted = shifted.astype("float32")
else:
mask = obj.dtypes.astype(str).isin(dtypes_to_f32).values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm what is this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing groupby diff behavior upcast int8 and int16 to float32, not float64. However, groupby shift upcasts to float64. See #45562 for more discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its fine to break it here (and change the tests) upcatsing to float32 is weird

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do. That means groupby.diff will upcast differently vs Series.diff, but that can be discussed in #45562.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you would fix that too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see we did that because of some cython issue. ok then let's preserve it (you will need an explict test for this ) and i don't think what you are doing is actually correct for frames

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed the type coercion and added an explicit test.

@jreback jreback added Groupby Performance Memory or execution speed performance labels Jan 23, 2022
@lukemanley
Copy link
Member Author

Might be worth deciding on #45609 before proceeding here.

@jreback jreback added this to the 1.5 milestone Jan 27, 2022
@jreback
Copy link
Contributor

jreback commented Feb 26, 2022

ok let's use this PR. can you rebase. we have sufficient asv's here (for the small dtypes as well)?

@final
@Substitution(name="groupby")
@Appender(_common_see_also)
def diff(self, periods=1, axis=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type these args & return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typing added

@lukemanley
Copy link
Member Author

ok let's use this PR. can you rebase. we have sufficient asv's here (for the small dtypes as well)?

Added int16 to the GroupByMethods asv. Shows a similar improvement as other dtypes:

       before           after         ratio
     [19aa5ab5]       [c4459b4e]
     <main>           <faster-groupby-diff>
-      77.7±0.5ms      1.60±0.02ms     0.02  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'transformation', 5)
-        88.3±1ms      1.33±0.07ms     0.02  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'transformation', 5)
-        74.8±1ms          601±3μs     0.01  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'transformation', 1)
-         109±1ms         577±10μs     0.01  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'transformation', 1)
-      79.2±0.9ms          308±3μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'direct', 5)
-      75.6±0.9ms          288±6μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'direct', 1)
-        89.6±2ms          322±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'direct', 5)
-         109±2ms          192±7μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'direct', 1)

@jreback
Copy link
Contributor

jreback commented Feb 27, 2022

https://github.com/pandas-dev/pandas/runs/5346468863?check_suite_focus=true the asv benches are failing, can you take a look

@lukemanley
Copy link
Member Author

@jreback - green, asv issues fixed

@jreback jreback merged commit e70d310 into pandas-dev:main Feb 28, 2022
@jreback
Copy link
Contributor

jreback commented Feb 28, 2022

thanks @lukemanley very nice!

@lukemanley lukemanley deleted the faster-groupby-diff branch March 2, 2022 01:12
@jbrockmendel
Copy link
Member

@lukemanley running asvs comparing main (as of yesterday) to v.1.4.0 i'm seeing some pretty big slowdowns in groupby.diff. For whatever reason I don't see them here (https://pandas.pydata.org/speed/pandas/index.html#regressions?sort=3&dir=desc). Any idea what might have happened?

@lukemanley
Copy link
Member Author

hmmm... I still see a significant speedup when comparing main to the latest 1.4.x commit. I just ran the following:

asv continuous -f 1.1 upstream/1.4.x upstream/main -b groupby.GroupByMethods
       before           after         ratio
     [21de060f]       [d86e2006]
                      <main>    
-      75.4±0.8ms      1.52±0.01ms     0.02  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'transformation', 5)
-        78.6±4ms      1.51±0.02ms     0.02  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'transformation', 5)
-       89.1±10ms      1.65±0.01ms     0.02  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'transformation', 5)
-        93.4±1ms      1.35±0.01ms     0.01  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 5)
-        97.0±1ms      1.38±0.05ms     0.01  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'transformation', 5)
-        115±40ms      1.55±0.05ms     0.01  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'transformation', 5)
-        100±10ms      1.31±0.01ms     0.01  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'transformation', 5)
-         133±2ms      1.67±0.03ms     0.01  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'transformation', 5)
-         148±2ms      1.37±0.01ms     0.01  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'transformation', 5)
-         144±1ms      1.27±0.02ms     0.01  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'transformation', 5)
-        70.7±1ms          550±6μs     0.01  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'transformation', 1)
-       88.1±20ms         623±10μs     0.01  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'transformation', 1)
-        76.5±5ms         510±10μs     0.01  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'transformation', 1)
-         109±2ms         582±80μs     0.01  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 1)
-         113±3ms          589±9μs     0.01  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'transformation', 1)
-         131±1ms         675±20μs     0.01  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'transformation', 1)
-         117±7ms          586±6μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'transformation', 1)
-        112±40ms          522±5μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'transformation', 1)
-         172±2ms          639±7μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'transformation', 1)
-        95.2±1ms         349±10μs     0.00  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'direct', 5)
-      94.4±0.8ms          343±1μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'direct', 5)
-       90.0±20ms          317±3μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'direct', 5)
-        99.4±7ms          328±2μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'direct', 5)
-         172±2ms          533±9μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'transformation', 1)
-       94.4±20ms          285±7μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'direct', 1)
-      72.9±0.7ms          207±1μs     0.00  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'direct', 5)
-      68.9±0.5ms          191±4μs     0.00  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'direct', 1)
-         134±1ms          367±7μs     0.00  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'direct', 5)
-         133±3ms          329±4μs     0.00  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'direct', 1)
-         145±1ms          325±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'direct', 5)
-         145±5ms          322±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'direct', 5)
-        82.2±3ms          177±5μs     0.00  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'direct', 5)
-        76.4±5ms        163±0.9μs     0.00  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'direct', 1)
-        102±40ms          202±1μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 5)
-         107±2ms         203±20μs     0.00  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'direct', 1)
-       107±0.7ms          188±1μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'direct', 1)
-         113±5ms          186±2μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'direct', 1)
-        127±60ms         190±20μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 1)
-         176±9ms          183±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'direct', 1)
-         171±2ms          176±2μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'direct', 1)

@jbrockmendel
Copy link
Member

Can you compare against v1.4.0 instead of 1.4.x? If the slowdown is between 1.4.0 and 1.4.x then it must be for a bugfix, in which case we can ignore it.

@lukemanley
Copy link
Member Author

@jbrockmendel - seeing a similar speedup vs. 1.4.0 as well:

       before           after         ratio
     [bb1f6515]       [d86e2006]
     <v1.4.0^0>       <main>    
-        75.3±2ms       2.18±0.7ms     0.03  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'transformation', 5)
-        80.4±1ms       1.98±0.5ms     0.02  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'transformation', 5)
-        79.6±2ms       1.73±0.2ms     0.02  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'transformation', 5)
-        81.5±3ms       1.75±0.2ms     0.02  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'transformation', 5)
-        94.7±3ms       1.43±0.1ms     0.02  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'transformation', 5)
-        94.2±2ms      1.31±0.02ms     0.01  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'transformation', 5)
-         133±2ms       1.79±0.1ms     0.01  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'transformation', 5)
-        95.2±5ms      1.28±0.02ms     0.01  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 5)
-         149±8ms      1.41±0.03ms     0.01  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'transformation', 5)
-      76.7±0.8ms        705±100μs     0.01  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'transformation', 1)
-         148±8ms      1.24±0.01ms     0.01  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'transformation', 5)
-        74.6±2ms        627±200μs     0.01  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'transformation', 1)
-        73.7±1ms         548±40μs     0.01  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'transformation', 1)
-        73.5±2ms          543±8μs     0.01  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'transformation', 1)
-       111±0.7ms         641±70μs     0.01  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'transformation', 1)
-         129±3ms         698±40μs     0.01  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'transformation', 1)
-         110±4ms         588±20μs     0.01  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'transformation', 1)
-         112±2ms          571±5μs     0.01  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 1)
-        76.1±3ms         334±70μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'direct', 1)
-        79.3±3ms         343±30μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int16', 'diff', 'direct', 5)
-        92.7±2ms         379±50μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'direct', 5)
-         173±2ms         641±20μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'transformation', 1)
-        92.2±3ms          341±5μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'direct', 5)
-      92.6±0.7ms         342±20μs     0.00  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'direct', 5)
-         135±4ms         405±40μs     0.00  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'direct', 5)
-         181±2ms          537±8μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'transformation', 1)
-        75.1±1ms         219±30μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 1)
-        80.5±3ms         226±30μs     0.00  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 5)
-        77.2±5ms          216±3μs     0.00  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'direct', 5)
-         130±4ms         346±40μs     0.00  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'direct', 1)
-        74.6±3ms          192±2μs     0.00  groupby.GroupByMethods.time_dtype_as_field('uint', 'diff', 'direct', 1)
-        80.4±5ms         206±20μs     0.00  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'direct', 5)
-        73.3±1ms         185±20μs     0.00  groupby.GroupByMethods.time_dtype_as_field('float', 'diff', 'direct', 1)
-         148±5ms          323±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'direct', 5)
-         149±1ms          315±4μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'direct', 5)
-         112±6ms         200±20μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int', 'diff', 'direct', 1)
-         107±3ms          190±2μs     0.00  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'direct', 1)
-         112±2ms          185±1μs     0.00  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'direct', 1)
-         174±8ms          179±3μs     0.00  groupby.GroupByMethods.time_dtype_as_group('float', 'diff', 'direct', 1)
-         179±9ms          175±2μs     0.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'direct', 1)

@jbrockmendel
Copy link
Member

thanks. i found a slowdown on my linux machine, will try again on mac and see if i can figure out whats going on

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: groupby-diff
3 participants