PERF (CoW): optimize equality check in groupby is_in_obj #50730

jorisvandenbossche · 2023-01-13T13:16:38Z

See discussion in #49450

TODO:

Probably need to add more tests (ensure full helper is covered)

jorisvandenbossche · 2023-01-13T13:18:07Z

cc @jbrockmendel are you aware of something similar that already exists in the code base?

jbrockmendel · 2023-01-13T15:49:13Z

are you aware of something similar that already exists in the code base?

nope, will be nice to have

jbrockmendel · 2023-01-13T15:49:39Z

pandas/core/groupby/grouper.py

+    if isinstance(arr1, np.ndarray):
+        # slice first element -> if first element shares memory and they are
+        # equal length -> share exactly memory for the full length (if not
+        # checking first element, two arrays might overlap partially)


are they necessarily contiguous?

That's a good question (and we should cover that in tests). No, not necessarily I suppose. But the question is if that matters?

In theory they could have a first same memory point, but if not contiguous and if they have different strides, they might still have different values.
But can you get that in practice with a Series from a DataFrame?

Can of course easily check that the strides are equal. But is that enough guarantee?

You can get there, probably also in the groupby case:

df = DataFrame({"a": list(range(100)), "b": 1}) ser = df.loc[slice(0, 50), "a"] ser2 = df.loc[slice(0, 100, 2), "a"] np.shares_memory(ser._values, ser2._values)

Edit: checking the strides fixes this of course

jbrockmendel · 2023-01-13T16:36:08Z

Can of course easily check that the strides are equal. But is that enough guarantee?

IIRC I discussed something similar with @seberg a while back. I don't remember what he said, other than suggesting an approach that didn't involve checking memory.

seberg · 2023-01-13T19:53:58Z

I remember a brief discussion at some point, but not really what I suggested or the details...

If all you want is to check identity, then you can check the starting pointer and the stride(s). np.shares_memory(arr1[:1], arr2[:1]) is hard to confuse (possible in NumPy, but you have to construct it with will).

Maybe I noted that overlap checking is difficult (not sure how hard it is exactly in 1-D), but since you are interested in identity checking that doesn't matter.
I guess I always imagined that CoW implementations would have a view.base and weakref's back to keep track of what is shared (so the parent knows all its views), but I don't actually know anything about CoW implementations, so :)...

jorisvandenbossche · 2023-01-13T21:00:47Z

I guess I always imagined that CoW implementations would have a view.base and weakref's back to keep track of what is shared (so the parent knows all its views),

That's exactly what we do in our CoW implementation in the internals, but, here it's some special case in a groupby code path where we previously relied on actual (python object level) identity, which no longer works with CoW enabled (getting a column twice no longer gives python-identical objects). So I was trying to find some way to see if two objects had exactly the same memory as alternative check (although not exactly equivalent, as an actual view would share the same memory but not be identical).

But now you say this, I actually didn't think of that we could use this base/refs information ... Thanks for the insight! :)

github-actions · 2023-02-13T00:05:27Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jorisvandenbossche · 2023-02-17T09:07:18Z

Closing in favor of #51442, which essentially does the last suggestion here (using the ref information to know if they are views of each other)

PERF (CoW): optimize equality check in groupby is_in_obj

2bfcb96

jorisvandenbossche mentioned this pull request Jan 13, 2023

API / CoW: always return new objects for column access (don't use item_cache) #49450

Merged

4 tasks

jbrockmendel reviewed Jan 13, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Feb 8, 2023

DOC: df.groupby('A') is just syntactic sugar for df.groupby(df['A']) #51063

Closed

1 task

github-actions bot added the Stale label Feb 13, 2023

jorisvandenbossche closed this Feb 17, 2023

jorisvandenbossche deleted the cow-groupby-optimize-equality-check branch February 17, 2023 09:07

jorisvandenbossche mentioned this pull request Feb 17, 2023

ENH: Improve ref-tracking for group keys #51442

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF (CoW): optimize equality check in groupby is_in_obj #50730

PERF (CoW): optimize equality check in groupby is_in_obj #50730

jorisvandenbossche commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

jbrockmendel commented Jan 13, 2023

jbrockmendel Jan 13, 2023

jorisvandenbossche Jan 13, 2023

jorisvandenbossche Jan 13, 2023

phofl Jan 13, 2023 •

edited

Loading

jbrockmendel commented Jan 13, 2023

seberg commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

github-actions bot commented Feb 13, 2023

jorisvandenbossche commented Feb 17, 2023

PERF (CoW): optimize equality check in groupby is_in_obj #50730

PERF (CoW): optimize equality check in groupby is_in_obj #50730

Conversation

jorisvandenbossche commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

jbrockmendel commented Jan 13, 2023

jbrockmendel Jan 13, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jan 13, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jan 13, 2023

Choose a reason for hiding this comment

phofl Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

jbrockmendel commented Jan 13, 2023

seberg commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

github-actions bot commented Feb 13, 2023

jorisvandenbossche commented Feb 17, 2023

phofl Jan 13, 2023 •

edited

Loading