-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF (CoW): optimize equality check in groupby is_in_obj #50730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF (CoW): optimize equality check in groupby is_in_obj #50730
Conversation
cc @jbrockmendel are you aware of something similar that already exists in the code base? |
nope, will be nice to have |
if isinstance(arr1, np.ndarray): | ||
# slice first element -> if first element shares memory and they are | ||
# equal length -> share exactly memory for the full length (if not | ||
# checking first element, two arrays might overlap partially) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are they necessarily contiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question (and we should cover that in tests). No, not necessarily I suppose. But the question is if that matters?
In theory they could have a first same memory point, but if not contiguous and if they have different strides, they might still have different values.
But can you get that in practice with a Series from a DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can of course easily check that the strides are equal. But is that enough guarantee?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get there, probably also in the groupby case:
df = DataFrame({"a": list(range(100)), "b": 1})
ser = df.loc[slice(0, 50), "a"]
ser2 = df.loc[slice(0, 100, 2), "a"]
np.shares_memory(ser._values, ser2._values)
Edit: checking the strides fixes this of course
IIRC I discussed something similar with @seberg a while back. I don't remember what he said, other than suggesting an approach that didn't involve checking memory. |
I remember a brief discussion at some point, but not really what I suggested or the details... If all you want is to check identity, then you can check the starting pointer and the stride(s). Maybe I noted that overlap checking is difficult (not sure how hard it is exactly in 1-D), but since you are interested in identity checking that doesn't matter. |
That's exactly what we do in our CoW implementation in the internals, but, here it's some special case in a groupby code path where we previously relied on actual (python object level) identity, which no longer works with CoW enabled (getting a column twice no longer gives python-identical objects). So I was trying to find some way to see if two objects had exactly the same memory as alternative check (although not exactly equivalent, as an actual view would share the same memory but not be identical). But now you say this, I actually didn't think of that we could use this base/refs information ... Thanks for the insight! :) |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Closing in favor of #51442, which essentially does the last suggestion here (using the ref information to know if they are views of each other) |
See discussion in #49450
TODO: