Skip to content

Copy method does not make truly deep copies of dtype object arrays #12663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
agartland opened this issue Mar 17, 2016 · 5 comments
Closed

Copy method does not make truly deep copies of dtype object arrays #12663

agartland opened this issue Mar 17, 2016 · 5 comments
Labels
Compat pandas objects compatability with Numpy or Python functions Docs Usage Question
Milestone

Comments

@agartland
Copy link
Contributor

The copy method of pd.Series and pd.DataFrame has a parameter deep which claims to Make a deep copy, i.e. also copy data. The example below seems to show that this isn't a truly deep copy (as in from copy import deepcopy) I can't seem to find the implementation in the source. I am wondering if this behavior below is expected, if something different should be done for dtype=object to make it a truly deep copy, or if we could at least add a note to the documentation that notes this behavior?

Thanks!

import pandas as pd
import copy
import numpy as np

"""Create a Series based on np array of objects"""
a = pd.Series(np.array([1, 'series of objects', {'first':3,'second':5}], dtype=np.object))
b = a.copy(deep=True)
c = copy.deepcopy(a)

"""The dict has length 2"""
print len(a.loc[2]), len(b.loc[2]), len(c.loc[2])

"""Remove one key from c (the deepcopy)"""
c.loc[2].pop('first')

"""Only changes c"""
print len(a.loc[2]), len(b.loc[2]), len(c.loc[2])

"""Remove one key from c (the pandas deepcopy)"""
b.loc[2].pop('first')

"""Changes a and b?"""
print len(a.loc[2]), len(b.loc[2]), len(c.loc[2])
@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

this is not supported. pandas objects are stored in numpy arrays (generally) which if they have object dtype are simply pointers to python objects. deep-copying them does not imply that numpy array is deep-copied itself (I don't even know if that is actually supported). Deep refers to the indexes themselves being copied.

It would be expensive to do this. Not really even sure of a usecase for it; generally the actual data are scalar type data (e.g. float, int, string), not actual python objects themselves.

This is an anti-pattern to store python objects here. I suppose you could add a note to the doc-string.

@jreback jreback added Usage Question Compat pandas objects compatability with Numpy or Python functions labels Mar 17, 2016
@agartland
Copy link
Contributor Author

Thanks, that's helpful, I understand why the copy method behaves as it does. I'm wondering if it could be made more clear in the documentation since the word "deep" seems at least slightly ambiguous here.

I typically use a DataFrame for numbers, but sometimes I like to have a column that holds some other kind of meta-data in an object. This way I get all the benefits of indexing and mergeing and can keep the meta-data associated with the data (even if this effectively disables many of numpy's nice features and efficiencies)

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

@agartland sure, a doc-string update would be fine. you can even put your example there to make it clear (maybe slim it down a bit).

@jreback jreback added this to the Next Major Release milestone Mar 17, 2016
@agartland
Copy link
Contributor Author

I added a note in the copy method docstring that should help. I didn't know where/how to add an example but here's a slimmed down version if you think its useful:

import copy

a = pd.Series([1, 'a', [4,5,6]])
b = a.copy(deep=True)
c = copy.deepcopy(a)

print a

"""Changes to the copy.deepcopy don't affect the original."""
c.loc[2].append(0)
print a.loc[2], b.loc[2], c.loc[2]

"""Changes to the a.copy(deep=True) are reflected in the original."""
b.loc[2].append(-9)
print a.loc[2], b.loc[2], c.loc[2]

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 18, 2016
@bergkvist
Copy link

@agartland For me the original is affected in both cases.
Code:

import copy
import pandas as pd

a = pd.Series([1, 'a', [4,5,6]])
b = a.copy(deep=True)
c = copy.deepcopy(a)

print(a)

"""Changes to the copy.deepcopy don't affect the original."""
c.loc[2].append(0)
print(a.loc[2], b.loc[2], c.loc[2])

"""Changes to the a.copy(deep=True) are reflected in the original."""
b.loc[2].append(-9)
print(a.loc[2], b.loc[2], c.loc[2])

Output:

0            1
1            a
2    [4, 5, 6]
dtype: object
[4, 5, 6, 0] [4, 5, 6, 0] [4, 5, 6, 0]
[4, 5, 6, 0, -9] [4, 5, 6, 0, -9] [4, 5, 6, 0, -9]

Versions:

  • pandas: 0.25.0
  • Python: 3.7.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Docs Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants