Skip to content

API: how to create a "shallow copy" of a DataFrame? #29309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Oct 31, 2019 · 4 comments
Open

API: how to create a "shallow copy" of a DataFrame? #29309

jorisvandenbossche opened this issue Oct 31, 2019 · 4 comments

Comments

@jorisvandenbossche
Copy link
Member

How can you create a copy of the DataFrame without copying the actual data, but having a new DataFrame that when updated (not in place) does not modify the original ("shallow copy")? And how is this expected to behave?
I suppose that in technical terms, this would be a new BlockManager that references the same arrays?

I ran in the above questions, and actually didn't know a clear answer. The context was: I wanted to replace one column of a DataFrame, but without modifying the original one. And so was wondering if I could do that without making a full copy of the DataFrame (as in theory this is not needed, and I just wanted to update one object column before serializing).


So you can do something like this with copy(deep=False). Let's explore this somewhat:

Making a normal (deep) and shallow copy:

In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3]}) 

In [2]: df_copy = df.copy() 

In [3]: df_shallow = df.copy(deep=False)

Modifying values in place works as expected: for the copy it does not change the original df, for the shallow copy it does:

In [4]: df_copy.iloc[0,0] = 10  

In [5]: df_shallow.iloc[1,0] = 20  

In [6]: df    
Out[6]: 
    a    b
0   1  0.1
1  20  0.2
2   3  0.3

Overwriting a full column, however, becomes more tricky (due to our BlockManager ...):

# this updates the original df
In [7]: df_shallow['a'] = [10, 20, 30] 

In [8]: df
Out[8]: 
    a    b
0  10  0.1
1  20  0.2
2  30  0.3

# this does not update the original
In [9]: df_shallow['b'] = [100, 200, 300]  

In [10]: df_shallow  
Out[10]: 
    a    b
0  10  100
1  20  200
2  30  300

In [11]: df  
Out[11]: 
    a    b
0  10  0.1
1  20  0.2
2  30  0.3

This is of course somewhat expected if you know the internals: if the new column is of the same dtype, it seems to modify the array of the block in place, while if it needs to create a new block (because the dtype changed on assignment), the reference with the old data is broken and it doesn't modify the original dataframe.

While writing this down, I am realizing that my question is maybe more: should assigning a column (df['a'] = ..) be seen as an in-place modification of your dataframe that has impact through shallow copies?
Because in reality, df['a'] cannot always happen in place (if you are overwriting with a different dtype), this gives rather inconsistent and surprising behaviour depending on the dtypes.

@jorisvandenbossche
Copy link
Member Author

Because in reality, df['a'] cannot always happen in place (if you are overwriting with a different dtype), this gives rather inconsistent and surprising behaviour depending on the dtypes.

So a question could be: should we make this always create a new block, to achieve consistent behaviour? (but of course, this will give a performance degradation in certain cases, as a new consolidation will happen)

Another observation: for extension dtypes, even if it is the same dtype, it always seems to replace the block and not update the values in place:

In [50]: df = pd.DataFrame({'a': pd.array([1, 2, 3], dtype='Int64'), 'b': [.1, .2, .3]})  

In [51]: df_shallow = df.copy(deep=False)  

In [52]: df_shallow['a'] = pd.array([10, 20, 30], dtype='Int64')   

In [53]: df   
Out[53]: 
   a    b
0  1  0.1
1  2  0.2
2  3  0.3

@jsignell
Copy link
Contributor

jsignell commented Feb 4, 2020

I don't know about performance implications, but it seems reasonable that you should create a new block every time you assign to an existing column. I think it's best to have a consistent behavior even if that behavior might be slightly less expected.

@jorisvandenbossche
Copy link
Member Author

I agree that consistent behaviour would be nice. But I am not directly sure what possible backward compatibility problems might have.

It's already good that the extension dtypes follow the desired behaviour (maybe we should add an explicit test for that)

@jsignell
Copy link
Contributor

jsignell commented Feb 4, 2020

Adding a test sounds like a good idea :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants