-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: faster groupby diff #45575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: faster groupby diff #45575
Conversation
# int8 and int16 are coerced to float32 rather than float64. | ||
dtypes_to_f32 = ["int8", "int16"] | ||
if obj.ndim == 1: | ||
if obj.dtype in dtypes_to_f32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will only work on the series case. why is this necessary? i think upcasting is fine here (likey what we do now)
pandas/core/groupby/groupby.py
Outdated
if obj.dtype in dtypes_to_f32: | ||
shifted = shifted.astype("float32") | ||
else: | ||
mask = obj.dtypes.astype(str).isin(dtypes_to_f32).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing groupby diff behavior upcast int8 and int16 to float32, not float64. However, groupby shift upcasts to float64. See #45562 for more discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its fine to break it here (and change the tests) upcatsing to float32 is weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do. That means groupby.diff will upcast differently vs Series.diff, but that can be discussed in #45562.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no you would fix that too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see we did that because of some cython issue. ok then let's preserve it (you will need an explict test for this ) and i don't think what you are doing is actually correct for frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed the type coercion and added an explicit test.
Might be worth deciding on #45609 before proceeding here. |
ok let's use this PR. can you rebase. we have sufficient asv's here (for the small dtypes as well)? |
pandas/core/groupby/groupby.py
Outdated
@final | ||
@Substitution(name="groupby") | ||
@Appender(_common_see_also) | ||
def diff(self, periods=1, axis=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type these args & return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typing added
Added int16 to the GroupByMethods asv. Shows a similar improvement as other dtypes:
|
https://github.com/pandas-dev/pandas/runs/5346468863?check_suite_focus=true the asv benches are failing, can you take a look |
@jreback - green, asv issues fixed |
thanks @lukemanley very nice! |
@lukemanley running asvs comparing main (as of yesterday) to v.1.4.0 i'm seeing some pretty big slowdowns in groupby.diff. For whatever reason I don't see them here (https://pandas.pydata.org/speed/pandas/index.html#regressions?sort=3&dir=desc). Any idea what might have happened? |
hmmm... I still see a significant speedup when comparing main to the latest 1.4.x commit. I just ran the following:
|
Can you compare against v1.4.0 instead of 1.4.x? If the slowdown is between 1.4.0 and 1.4.x then it must be for a bugfix, in which case we can ignore it. |
@jbrockmendel - seeing a similar speedup vs. 1.4.0 as well:
|
thanks. i found a slowdown on my linux machine, will try again on mac and see if i can figure out whats going on |
As @jreback suggests in the original issue, there is a large perf improvement to groupby.diff if implemented directly as a groupby method.
The int8/int16 type coercion feels a bit odd, but was needed to retain existing behavior as well as pass existing unit tests that explicitly test those dtypes. Separate issue to discuss that: #45562