API: how to create a "shallow copy" of a DataFrame? #29309

jorisvandenbossche · 2019-10-31T11:07:33Z

How can you create a copy of the DataFrame without copying the actual data, but having a new DataFrame that when updated (not in place) does not modify the original ("shallow copy")? And how is this expected to behave?
I suppose that in technical terms, this would be a new BlockManager that references the same arrays?

I ran in the above questions, and actually didn't know a clear answer. The context was: I wanted to replace one column of a DataFrame, but without modifying the original one. And so was wondering if I could do that without making a full copy of the DataFrame (as in theory this is not needed, and I just wanted to update one object column before serializing).

So you can do something like this with copy(deep=False). Let's explore this somewhat:

Making a normal (deep) and shallow copy:

In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3]}) 

In [2]: df_copy = df.copy() 

In [3]: df_shallow = df.copy(deep=False)

Modifying values in place works as expected: for the copy it does not change the original df, for the shallow copy it does:

In [4]: df_copy.iloc[0,0] = 10  

In [5]: df_shallow.iloc[1,0] = 20  

In [6]: df    
Out[6]: 
    a    b
0   1  0.1
1  20  0.2
2   3  0.3

Overwriting a full column, however, becomes more tricky (due to our BlockManager ...):

# this updates the original df
In [7]: df_shallow['a'] = [10, 20, 30] 

In [8]: df
Out[8]: 
    a    b
0  10  0.1
1  20  0.2
2  30  0.3

# this does not update the original
In [9]: df_shallow['b'] = [100, 200, 300]  

In [10]: df_shallow  
Out[10]: 
    a    b
0  10  100
1  20  200
2  30  300

In [11]: df  
Out[11]: 
    a    b
0  10  0.1
1  20  0.2
2  30  0.3

This is of course somewhat expected if you know the internals: if the new column is of the same dtype, it seems to modify the array of the block in place, while if it needs to create a new block (because the dtype changed on assignment), the reference with the old data is broken and it doesn't modify the original dataframe.

While writing this down, I am realizing that my question is maybe more: should assigning a column (df['a'] = ..) be seen as an in-place modification of your dataframe that has impact through shallow copies?
Because in reality, df['a'] cannot always happen in place (if you are overwriting with a different dtype), this gives rather inconsistent and surprising behaviour depending on the dtypes.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-10-31T12:29:12Z

Because in reality, df['a'] cannot always happen in place (if you are overwriting with a different dtype), this gives rather inconsistent and surprising behaviour depending on the dtypes.

So a question could be: should we make this always create a new block, to achieve consistent behaviour? (but of course, this will give a performance degradation in certain cases, as a new consolidation will happen)

Another observation: for extension dtypes, even if it is the same dtype, it always seems to replace the block and not update the values in place:

In [50]: df = pd.DataFrame({'a': pd.array([1, 2, 3], dtype='Int64'), 'b': [.1, .2, .3]})  

In [51]: df_shallow = df.copy(deep=False)  

In [52]: df_shallow['a'] = pd.array([10, 20, 30], dtype='Int64')   

In [53]: df   
Out[53]: 
   a    b
0  1  0.1
1  2  0.2
2  3  0.3

jsignell · 2020-02-04T15:39:38Z

I don't know about performance implications, but it seems reasonable that you should create a new block every time you assign to an existing column. I think it's best to have a consistent behavior even if that behavior might be slightly less expected.

jorisvandenbossche · 2020-02-04T16:25:21Z

I agree that consistent behaviour would be nice. But I am not directly sure what possible backward compatibility problems might have.

It's already good that the extension dtypes follow the desired behaviour (maybe we should add an explicit test for that)

jsignell · 2020-02-04T21:36:22Z

Adding a test sounds like a good idea :)

jorisvandenbossche added the API Design label Oct 31, 2019

TomAugspurger mentioned this issue Dec 20, 2019

Use shallow copies when assigning columns dask/dask#5739

Closed

jorisvandenbossche mentioned this issue Mar 24, 2020

Dataframe change alters original array used in creation #32960

Closed

mroeschke added the Copy / view semantics label Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: how to create a "shallow copy" of a DataFrame? #29309

API: how to create a "shallow copy" of a DataFrame? #29309

jorisvandenbossche commented Oct 31, 2019

jorisvandenbossche commented Oct 31, 2019

jsignell commented Feb 4, 2020

jorisvandenbossche commented Feb 4, 2020

jsignell commented Feb 4, 2020

API: how to create a "shallow copy" of a DataFrame? #29309

API: how to create a "shallow copy" of a DataFrame? #29309

Comments

jorisvandenbossche commented Oct 31, 2019

jorisvandenbossche commented Oct 31, 2019

jsignell commented Feb 4, 2020

jorisvandenbossche commented Feb 4, 2020

jsignell commented Feb 4, 2020