-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: how to create a "shallow copy" of a DataFrame? #29309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So a question could be: should we make this always create a new block, to achieve consistent behaviour? (but of course, this will give a performance degradation in certain cases, as a new consolidation will happen) Another observation: for extension dtypes, even if it is the same dtype, it always seems to replace the block and not update the values in place:
|
I don't know about performance implications, but it seems reasonable that you should create a new block every time you assign to an existing column. I think it's best to have a consistent behavior even if that behavior might be slightly less expected. |
I agree that consistent behaviour would be nice. But I am not directly sure what possible backward compatibility problems might have. It's already good that the extension dtypes follow the desired behaviour (maybe we should add an explicit test for that) |
Adding a test sounds like a good idea :) |
How can you create a copy of the DataFrame without copying the actual data, but having a new DataFrame that when updated (not in place) does not modify the original ("shallow copy")? And how is this expected to behave?
I suppose that in technical terms, this would be a new BlockManager that references the same arrays?
I ran in the above questions, and actually didn't know a clear answer. The context was: I wanted to replace one column of a DataFrame, but without modifying the original one. And so was wondering if I could do that without making a full copy of the DataFrame (as in theory this is not needed, and I just wanted to update one object column before serializing).
So you can do something like this with
copy(deep=False)
. Let's explore this somewhat:Making a normal (deep) and shallow copy:
Modifying values in place works as expected: for the copy it does not change the original df, for the shallow copy it does:
Overwriting a full column, however, becomes more tricky (due to our BlockManager ...):
This is of course somewhat expected if you know the internals: if the new column is of the same dtype, it seems to modify the array of the block in place, while if it needs to create a new block (because the dtype changed on assignment), the reference with the old data is broken and it doesn't modify the original dataframe.
While writing this down, I am realizing that my question is maybe more: should assigning a column (
df['a'] = ..
) be seen as an in-place modification of your dataframe that has impact through shallow copies?Because in reality,
df['a']
cannot always happen in place (if you are overwriting with a different dtype), this gives rather inconsistent and surprising behaviour depending on the dtypes.The text was updated successfully, but these errors were encountered: