Skip to content

ENH: Improve ref-tracking for group keys #51442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 20, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 16, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

cc @jorisvandenbossche

Follow-up on #49450

The equals check causes failures in Dask. I think we are better off moving to reference checking anyway. If you are ok with the general logic here, I'd refactor this check down to the Manager level.

@rhshadrach
Copy link
Member

This would make df.groupy(df['a']) behave differently from df.groupby(df['a'].copy()), is that right? In general I'd like to move away from "in-axis" vs "out-axis" groupers behaving differently (ref: #49519), so I agree this is the right direction, but it might cause some odd edge cases to behave differently until we resolve those issues.

Still - I'm +0 here; checking equality gives some odd behaviors as well.

@phofl
Copy link
Member Author

phofl commented Feb 16, 2023

Yes those 2 behave different, but this pr establishes the same behavior as in the non-cow mode for this case. So that's actually a plus imo

except (AttributeError, KeyError, IndexError, InvalidIndexError):
return False
if isinstance(gpr, Series) and isinstance(obj_gpr_column, Series):
ref = weakref.ref(obj_gpr_column._mgr.blocks[0])
return ref in gpr._mgr.blocks[0].refs.referenced_blocks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you create multiple times a weakref to the same object, this is always the same (i.e. identical) weakref object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's how it is documented (https://docs.python.org/3/library/weakref.html#weakref.ref). Did some basic checks with out blocks and it works as I would have expected

@jorisvandenbossche
Copy link
Member

Yes, I think this is a good solution for now (essentially the same as what was suggested in #50730). And it indeed aligns with the behaviour we have for non-CoW.

Longer term, I agree with @rhshadrach that the current semantics are not really great. We should probably at some point move to not doing such check at all, at which point this discussion about identity vs equality also just doesn't matter anymore?

@phofl
Copy link
Member Author

phofl commented Feb 17, 2023

Yep this sounds sensible, but I think this needs a deprecation cycle?

I’ll move the check to the BlockManager then

@phofl
Copy link
Member Author

phofl commented Feb 17, 2023

Implemented it on the manager level

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two small comments to clean-up

@@ -947,9 +947,19 @@ def is_in_obj(gpr) -> bool:
# For the CoW case, we need an equality check as the identity check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -947,9 +947,19 @@ def is_in_obj(gpr) -> bool:
# For the CoW case, we need an equality check as the identity check
# no longer works (each Series from column access is a new object)
try:
return gpr.equals(obj[gpr.name])
obj_gpr_column = obj[gpr.name]
except (AttributeError, KeyError, IndexError, InvalidIndexError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the AttributeError now, I think (.name is already checked above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, thx

@jorisvandenbossche jorisvandenbossche merged commit 8275e88 into pandas-dev:main Feb 20, 2023
@jorisvandenbossche
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants