-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CoW: Avoid warning in apply for mixed dtype frame #56212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/apply.py
Outdated
if not is_view: | ||
# You really shouldn't do this... | ||
mgr.blocks[0].refs = BlockValuesRefs(mgr.blocks[0]) # type: ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we want to remove again after CoW is enforced? (i.e. it's not needed for CoW itself after 3.0)
In that case I would maybe add a TODO that this can be removed later, as from just looking at the code it is not obvious that it's only for the warning mode (like it's not behind an if check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not only for the warning mode. Apply currently trigger a copy for every row, which is something that this helps us avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes of course.
ser._mgr = mgr | ||
mgr.set_values(arr) | ||
object.__setattr__(ser, "_name", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't understand how this part adds an extra ref after the first row (i.e. the reason you need to reset refs
to have one ref to only itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I assume that's because we create a shallow copy of the result:
Lines 1078 to 1084 in 7b528c9
for i, v in enumerate(series_gen): | |
# ignore SettingWithCopy here in case the user mutates | |
results[i] = self.func(v, *self.args, **self.kwargs) | |
if isinstance(results[i], ABCSeries): | |
# If we have a view on v, we need to make a copy because | |
# series_generator will swap out the underlying data | |
results[i] = results[i].copy(deep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner to create a shallow "copy" of the Series that does not keep a reference? (I don't think we have any helper function to do that right now? except for Series(results[i]._values, copy=False, index=.., name=..)
, like what you did in the eval
PR to create Series objects without a reference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was that this is optimized for performance and I didn’t want to screw with that. And keeping the ugly parts in one method seemed like a fine compromise
that said, the udf could also set up references that could screw us over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t want to screw with that
It's already taking a shallow copy always (even when not needed), so having a specialized shallow copy without tracking refs should at best slightly improve things.
The logic about requiring this shallow copy also already lives where the series_generator is used, so I think I find it slightly clearer to consolidate that logic there
Now, actually implementing a Series._shallow_copy_without_refs()
is not fully trivial because AFAIK we can't just reuse the existing mgr and block copy
methods and reset the refs after the fact, as that then already appended the refs object which is shared with the series generator series with a new ref.
So let's consider that for a follow-up PR if I want to pursue that ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just going to update some comments here, then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd rather not do too many helpers that make this easy since you really shouldn't be doing this
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Yikes this is ugly
xref #56019