Skip to content

API / CoW: DataFrame(<dict of Series>, copy=False) constructor now gives lazy copy #50777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

xref #48998 (bullet point about constructors)

When constructing a DataFrame from Series objects, and when specifying copy=False with CoW enabled, we need to follow the lazy copy with CoW approach to ensure proper copy/view behaviour with the original Series objects.

We could in theory change the default of copy=True (when data passed to the constructor is a dict) to copy=False when CoW is enabled, and this would retain the behaviour as-if the resulting DataFrame is a copy. But, I assume that the major reason for this default is to ensure the resulting DataFrame is consolidated and not consisting of all 1-column blocks. So for that reason I left the default as is for now.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review January 17, 2023 10:08
if hval is val:
hval = val.copy(deep=False)
homogenized.append(hval._values)
parents.append(hval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this only relevant if reindex did not make a copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in create_block_manager_from_column_arrays (in managers.py) I will only actually use this parent if it has a ref itself, and so if reindex did a copy here, this Series will not have any refs, and it will not be tracked as parent in the end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True this would work, but the following does not work:

val = Series([1, 2, 3.0])
df = pd.DataFrame({"a": val}, dtype="int64", index=[0, 1, 2])


assert not np.shares_memory(get_array(df, "a"), val._values)

x = get_array(df, "a")
df.iloc[0, 0] = 100
assert np.shares_memory(get_array(df, "a"), x)

In this scenario astype makes a copy while reindex does not. Hence, hval has a ref to the object created from astype which is inherited by the constructed element that is our result. Consequentially, iloc triggers an unnecessary copy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, if you have both a reindex and astype, that makes things a bit more complicated ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated for this by checking in each step if an actual copy was made, and then if any of the steps did make a copy, we don't add the Series to the parents (thus, also if the first astype step copied, and then the next reindex step did not copy, we wouldn't add the parent, and not trigger a CoW unnecessarily)

And added a test that covers your example above (that was failing before the fix)

@jorisvandenbossche
Copy link
Member Author

@phofl this should be ready now

(the current CI failures are unrelated; the arm one also fails on main)

@phofl
Copy link
Member

phofl commented Jan 24, 2023

This is a bit annoying, simply checking for new_values is old_values is not sufficient (same problem I am having in astype).

val = Series([1, 2, 3])
df = pd.DataFrame({"a": val}, dtype="Int64", index=[0, 1, 2])

x = get_array(df, "a")
df.iloc[0, 0] = 100

The is comparison is False in this case but the data aren't actually copied, hence val and df is updated with the last statement. I guess we have to create a helper function for astype doing a more sophisticated check...

Edit: Can you add this as a test case as well?

@jorisvandenbossche
Copy link
Member Author

Added an xfailed test case for constructing with dtype="Int64", assuming that #50802 will fix this.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small comments

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments for the tests

)

# the shallow copy still shares memory
assert np.shares_memory(get_array(result, "a"), s1.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use get_array here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also below

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, lgtm. Good to merge on greenish

@jorisvandenbossche jorisvandenbossche merged commit d458291 into pandas-dev:main Feb 10, 2023
@jorisvandenbossche jorisvandenbossche deleted the cow-dataframe-constructor-from-series branch February 10, 2023 17:56
phofl pushed a commit to phofl/pandas that referenced this pull request Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Copy / view semantics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants