-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API / CoW: DataFrame(<dict of Series>, copy=False) constructor now gives lazy copy #50777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API / CoW: DataFrame(<dict of Series>, copy=False) constructor now gives lazy copy #50777
Conversation
if hval is val: | ||
hval = val.copy(deep=False) | ||
homogenized.append(hval._values) | ||
parents.append(hval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this only relevant if reindex did not make a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in create_block_manager_from_column_arrays
(in managers.py) I will only actually use this parent if it has a ref itself, and so if reindex
did a copy here, this Series will not have any refs, and it will not be tracked as parent in the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True this would work, but the following does not work:
val = Series([1, 2, 3.0])
df = pd.DataFrame({"a": val}, dtype="int64", index=[0, 1, 2])
assert not np.shares_memory(get_array(df, "a"), val._values)
x = get_array(df, "a")
df.iloc[0, 0] = 100
assert np.shares_memory(get_array(df, "a"), x)
In this scenario astype makes a copy while reindex does not. Hence, hval has a ref to the object created from astype which is inherited by the constructed element that is our result. Consequentially, iloc triggers an unnecessary copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, if you have both a reindex and astype, that makes things a bit more complicated ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated for this by checking in each step if an actual copy was made, and then if any of the steps did make a copy, we don't add the Series to the parents (thus, also if the first astype step copied, and then the next reindex step did not copy, we wouldn't add the parent, and not trigger a CoW unnecessarily)
And added a test that covers your example above (that was failing before the fix)
@phofl this should be ready now (the current CI failures are unrelated; the arm one also fails on main) |
This is a bit annoying, simply checking for new_values is old_values is not sufficient (same problem I am having in astype).
The is comparison is False in this case but the data aren't actually copied, hence val and df is updated with the last statement. I guess we have to create a helper function for astype doing a more sophisticated check... Edit: Can you add this as a test case as well? |
Added an xfailed test case for constructing with dtype="Int64", assuming that #50802 will fix this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of small comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments for the tests
) | ||
|
||
# the shallow copy still shares memory | ||
assert np.shares_memory(get_array(result, "a"), s1.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use get_array here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, lgtm. Good to merge on greenish
xref #48998 (bullet point about constructors)
When constructing a DataFrame from Series objects, and when specifying
copy=False
with CoW enabled, we need to follow the lazy copy with CoW approach to ensure proper copy/view behaviour with the original Series objects.We could in theory change the default of
copy=True
(when data passed to the constructor is a dict) tocopy=False
when CoW is enabled, and this would retain the behaviour as-if the resulting DataFrame is a copy. But, I assume that the major reason for this default is to ensure the resulting DataFrame is consolidated and not consisting of all 1-column blocks. So for that reason I left the default as is for now.doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.