Skip to content

CoW: Avoid warning in apply for mixed dtype frame #56212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 4, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 27, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Yikes this is ugly

xref #56019

Comment on lines 1273 to 1275
if not is_view:
# You really shouldn't do this...
mgr.blocks[0].refs = BlockValuesRefs(mgr.blocks[0]) # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we want to remove again after CoW is enforced? (i.e. it's not needed for CoW itself after 3.0)
In that case I would maybe add a TODO that this can be removed later, as from just looking at the code it is not obvious that it's only for the warning mode (like it's not behind an if check)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not only for the warning mode. Apply currently trigger a copy for every row, which is something that this helps us avoid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes of course.

Comment on lines 1270 to 1272
ser._mgr = mgr
mgr.set_values(arr)
object.__setattr__(ser, "_name", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't understand how this part adds an extra ref after the first row (i.e. the reason you need to reset refs to have one ref to only itself)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I assume that's because we create a shallow copy of the result:

pandas/pandas/core/apply.py

Lines 1078 to 1084 in 7b528c9

for i, v in enumerate(series_gen):
# ignore SettingWithCopy here in case the user mutates
results[i] = self.func(v, *self.args, **self.kwargs)
if isinstance(results[i], ABCSeries):
# If we have a view on v, we need to make a copy because
# series_generator will swap out the underlying data
results[i] = results[i].copy(deep=False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to create a shallow "copy" of the Series that does not keep a reference? (I don't think we have any helper function to do that right now? except for Series(results[i]._values, copy=False, index=.., name=..), like what you did in the eval PR to create Series objects without a reference)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was that this is optimized for performance and I didn’t want to screw with that. And keeping the ugly parts in one method seemed like a fine compromise

that said, the udf could also set up references that could screw us over

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t want to screw with that

It's already taking a shallow copy always (even when not needed), so having a specialized shallow copy without tracking refs should at best slightly improve things.

The logic about requiring this shallow copy also already lives where the series_generator is used, so I think I find it slightly clearer to consolidate that logic there

Now, actually implementing a Series._shallow_copy_without_refs() is not fully trivial because AFAIK we can't just reuse the existing mgr and block copy methods and reset the refs after the fact, as that then already appended the refs object which is shared with the series generator series with a new ref.
So let's consider that for a follow-up PR if I want to pursue that ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just going to update some comments here, then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd rather not do too many helpers that make this easy since you really shouldn't be doing this

@jorisvandenbossche jorisvandenbossche merged commit 3395ca3 into pandas-dev:main Dec 4, 2023
@phofl phofl deleted the apply_cow_warning branch December 5, 2023 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants