Skip to content

Dataframe change alters original array used in creation #32960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martinfleis opened this issue Mar 23, 2020 · 11 comments · Fixed by #38939
Closed

Dataframe change alters original array used in creation #32960

martinfleis opened this issue Mar 23, 2020 · 11 comments · Fixed by #38939
Labels
Blocker Blocking issue or pull request for an upcoming release Copy / view semantics
Milestone

Comments

@martinfleis
Copy link
Contributor

martinfleis commented Mar 23, 2020

Code Sample, a copy-pastable example if possible

a = pd.array([0, 1, 2])
df1 = pd.DataFrame({"col1": ['a', 'b', 'c'], "col2": a})
df1.loc[0, 'col2'] = 3

print(a)
<IntegerArray>
[3, 1, 2]
Length: 3, dtype: Int64

Problem description

Pandas master has this strange change in behaviour when df created from array changes the original array if the value in the related column is changed. Pandas 1.0.3 does not alter the original array. It seems that it does a copy somewhere while master does not do that anymore.

I don't think the original array should be altered. Doing the same with list or np.array does not alter the original list/array either.

Expected Output

<IntegerArray>
[0, 1, 2]
Length: 3, dtype: Int64

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d3e5485
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 19.3.0
Version : Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+944.gd3e54851b
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 41.6.0
Cython : 0.29.15
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@martinfleis martinfleis changed the title Dataframe change alter original array used in creation Dataframe change alters original array used in creation Mar 23, 2020
@jbrockmendel
Copy link
Member

I expect this was caused by #32831 (for which I still need to add a whatsnew note)

I think we do want a view on the original data, but you're right that being consistent is the priority. I'll look at this before long, but you're welcome to beat me to it.

@TomAugspurger
Copy link
Contributor

In the DataFrame docstring, we have

copy : bool, default False
    Copy data from inputs. Only affects DataFrame / 2d ndarray input.

It's a bit unclear whether having copy start applying to dict-like input is an API change or not, but it seems like getting copy to apply to all cases would be a good thing to have.

@jorisvandenbossche
Copy link
Member

We should certainly try to clear up the situation about what is the expected behaviour and test this properly.

I agree it is a good thing that we can create a DataFrame from a dict of EAs without requiring a copy.
But the question is maybe if this should be the default? We could also keep the copy behaviour by default, and allow copy=False to not copy (which means the actual default needs to change to None, as it will depend on the input).

Slightly related to #29309 as well (at least it's also about when to copy, and about a difference between ndarray and EA)

@TomAugspurger
Copy link
Contributor

We could also keep the copy behaviour by default, and allow copy=False to not copy (which means the actual default needs to change to None, as it will depend on the input).

Long-term, we want copy=False right? That's what we do for Series, and having the meaning of copy depend on the type of data seems strange. And I don't think we can have a warning here, it would be too noisy.

If we deem this is an API change, rather than a bugfix or new feature, we can manually insert a copy for dict-like input, and then remove that in 2.0.

@jorisvandenbossche
Copy link
Member

So trying to sum up the situation, at least for a subset of the cases.

With master:

EA ndarray
Series(..) view view
DataFrame({'a': ...}) view copy

With 0.25.3 (suppose the same with 1.0.3):

EA ndarray
Series(..) view view
DataFrame({'a': ...}) copy copy

Are there other cases relevant for checking?

Long-term, we want copy=False right?

It depends, we certainly want to be able to do copy=False, but we could still have a "safe" default of taking a copy (like, np.array also creates a copy from an array).
But, it's true that that would differ from Series' default.

@jorisvandenbossche
Copy link
Member

The original case where this was catched by @martinfleis is where we modified the array object (not the data).
So maybe at least the DataFrame constructor should take a "view" of the array? (same data, different array object)

@TomAugspurger
Copy link
Contributor

So trying to sum up the situation, at least for a subset of the cases.

That table isn't quite right. The behavior change isn't in __init__, it's that we no longer (accidentally) copy on write when modifying the array.

In [2]: pd.__version__
Out[2]: '1.0.3'

In [3]: a = pd.array([0, 1, 2])
   ...: df1 = pd.DataFrame({"col1": ['a', 'b', 'c'], "col2": a})

In [4]: df1.col2.array is a
Out[4]: True

In [5]: df1.loc[0, 'col2'] = 3

In [6]: df1.col2.array is a
Out[6]: False

One more note in favor of calling this a bugfix, with pandas 1.0.3 we didn't copy on write for homogenous dataframes with a single block.

In [8]: a = pd.array([0, 1, 2])
   ...: df1 = pd.DataFrame({"A": a})

In [9]: df1.A.array is a
Out[9]: True

In [10]: df1.iloc[0, 0] = 10

In [11]: a
Out[11]:
<IntegerArray>
[10, 1, 2]
Length: 3, dtype: Int64

Given the inconsistent behavior, I might be comfortable calling this a "bugfix with possible API-breaking consequences".

If we go that route, to ease with the transition do we want to update DataFrame.__intit__ to have copy apply to a dict of arrays as well?

@jorisvandenbossche
Copy link
Member

Ah, that's a good point .. Thanks for finding that nuance!

Now, to make it more complicated: that behaviour of the "no copy on write for a homogeneous dataframe" is only something new in pandas 1.0. Before that, we (consistently?) copied:

In [11]: pd.__version__ 
Out[11]: '0.25.3'

In [12]: a = pd.array([0, 1, 2])

In [13]: df1 = pd.DataFrame({"A": a})  

In [14]: df1.A.array is a   
Out[14]: False

In [15]: df1.iloc[0, 0] = 10  

In [16]: a    
Out[16]: 
<PandasArray>
[0, 1, 2]
Length: 3, dtype: int64

and it seems here it is actually not even a "copy-on-write", but a copy on initialization.

Do we know if this was a deliberate change between 0.25 -> 1.0?
(not that it needs to have been deliberate to be a potentially good change, just wondering if there would have been some related discussion on a PR or so).

@TomAugspurger
Copy link
Contributor

I don't see anything in the 1.0 changelog discussing that change.

@TomAugspurger
Copy link
Contributor

IMO the best path forward is to make DataFrame({"A": a, "B": b}, copy=True) copy all the values in the dictionary. I don't see a straightforward way to restore the 1.0.4 behavior, and it's not compelling enough to jump through hoops to restore it.

@TomAugspurger
Copy link
Contributor

I think #34872 is not happening for 1.1. So pushing to 1.2.

@TomAugspurger TomAugspurger modified the milestones: 1.1, 1.2 Jul 16, 2020
@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 25, 2020
@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 1.3 Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Copy / view semantics
Projects
None yet
5 participants