Skip to content

PERF (CoW): optimize equality check in groupby is_in_obj #50730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

See discussion in #49450

TODO:

  • Probably need to add more tests (ensure full helper is covered)

@jorisvandenbossche
Copy link
Member Author

cc @jbrockmendel are you aware of something similar that already exists in the code base?

@jbrockmendel
Copy link
Member

are you aware of something similar that already exists in the code base?

nope, will be nice to have

if isinstance(arr1, np.ndarray):
# slice first element -> if first element shares memory and they are
# equal length -> share exactly memory for the full length (if not
# checking first element, two arrays might overlap partially)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are they necessarily contiguous?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question (and we should cover that in tests). No, not necessarily I suppose. But the question is if that matters?

In theory they could have a first same memory point, but if not contiguous and if they have different strides, they might still have different values.
But can you get that in practice with a Series from a DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can of course easily check that the strides are equal. But is that enough guarantee?

Copy link
Member

@phofl phofl Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get there, probably also in the groupby case:

df = DataFrame({"a": list(range(100)), "b": 1})

ser = df.loc[slice(0, 50), "a"]
ser2 = df.loc[slice(0, 100, 2), "a"]
np.shares_memory(ser._values, ser2._values)

Edit: checking the strides fixes this of course

@jbrockmendel
Copy link
Member

Can of course easily check that the strides are equal. But is that enough guarantee?

IIRC I discussed something similar with @seberg a while back. I don't remember what he said, other than suggesting an approach that didn't involve checking memory.

@seberg
Copy link
Contributor

seberg commented Jan 13, 2023

I remember a brief discussion at some point, but not really what I suggested or the details...

If all you want is to check identity, then you can check the starting pointer and the stride(s). np.shares_memory(arr1[:1], arr2[:1]) is hard to confuse (possible in NumPy, but you have to construct it with will).

Maybe I noted that overlap checking is difficult (not sure how hard it is exactly in 1-D), but since you are interested in identity checking that doesn't matter.
I guess I always imagined that CoW implementations would have a view.base and weakref's back to keep track of what is shared (so the parent knows all its views), but I don't actually know anything about CoW implementations, so :)...

@jorisvandenbossche
Copy link
Member Author

I guess I always imagined that CoW implementations would have a view.base and weakref's back to keep track of what is shared (so the parent knows all its views),

That's exactly what we do in our CoW implementation in the internals, but, here it's some special case in a groupby code path where we previously relied on actual (python object level) identity, which no longer works with CoW enabled (getting a column twice no longer gives python-identical objects). So I was trying to find some way to see if two objects had exactly the same memory as alternative check (although not exactly equivalent, as an actual view would share the same memory but not be identical).

But now you say this, I actually didn't think of that we could use this base/refs information ... Thanks for the insight! :)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 13, 2023
@jorisvandenbossche
Copy link
Member Author

Closing in favor of #51442, which essentially does the last suggestion here (using the ref information to know if they are views of each other)

@jorisvandenbossche jorisvandenbossche deleted the cow-groupby-optimize-equality-check branch February 17, 2023 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants