-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: Honor copy for dict-input in DataFrame #34872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7b892bd
acf99dd
499080b
20c87ce
b0b125d
f9b3f16
8bf9f73
306d015
bb730cc
9f716c8
a9db888
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -390,7 +390,7 @@ def sanitize_array( | |
data, | ||
index: Optional["Index"], | ||
dtype: Optional[DtypeObj] = None, | ||
copy: bool = False, | ||
copy: Optional[bool] = False, | ||
raise_cast_failure: bool = False, | ||
) -> ArrayLike: | ||
""" | ||
|
@@ -412,6 +412,9 @@ def sanitize_array( | |
|
||
# GH#846 | ||
if isinstance(data, np.ndarray): | ||
if copy is None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So if we ever wanted to reconcile the behavior between ndarray and extension array, we'd change this or L434. IMO that's not worth the noise. |
||
# copy by default for DataFrame({"A": ndarray}) | ||
copy = True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so you are changing the default here? this certainly needs a deprecation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #34872 (comment) has the best summary. There's no API-breaking changes here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, umm why aren't we not copying ndarrays though? I mean the default is false, happy to take a view for ndarrays' but this seems strange to do opposite things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure. I don't think we should change it here. If we want to deprecate the default behavior for ndarray or EA then that can be discussed as a followup. It should be relatively simple compared to this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not think we should do this, it is a horribly divergent behavior. We are deprecating tiny little things, this is a major change and am -1 unless this has the same behaviror for EA and ndarrays. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can't make API breaking changes in a minor release. And I don't see a cleaner way to solve #32960 than making I'm fine with aligning the default behavior between ndarray and EA, but that will have to wait for 2.0. And I think that the PR implementing the deprecation should be a followup since it'll introduce a bunch of uninformative changes in the tests making the PR harder to review. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you make this False i would be ok with this; True is a breaking change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On master, we copy for dict of ndarrays In [4]: import numpy as np
In [5]: import pandas as pd
a
In [6]: a = np.array([1, 2])
In [7]: df = pd.DataFrame({"A": a})
In [8]: df.iloc[0, 0] = 10
In [9]: a
Out[9]: array([1, 2]) Perhaps it's not clear from the diff, but this section is hit only for dict inputs. We continue to not copy by default for ndarray In [4]: a = np.array([1, 2])
In [5]: df = pd.DataFrame({"A": a})
In [6]: df.iloc[0, 0] = 10
In [7]: a
Out[7]: array([1, 2])
In [8]: b = np.ones((4, 4))
In [9]: df = pd.DataFrame(b)
In [10]: df.iloc[0, 0] = 10
In [11]: b
Out[11]:
array([[10., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]]) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although, now that I see your objection, other uses of |
||
|
||
if dtype is not None and is_float_dtype(data.dtype) and is_integer_dtype(dtype): | ||
# possibility of nan -> garbage | ||
|
@@ -428,15 +431,20 @@ def sanitize_array( | |
|
||
elif isinstance(data, ABCExtensionArray): | ||
# it is already ensured above this is not a PandasArray | ||
# no copy by default for DataFrame({"A": extension_array}) | ||
if copy is None: | ||
copy = False | ||
subarr = data | ||
|
||
if dtype is not None: | ||
subarr = subarr.astype(dtype, copy=copy) | ||
elif copy: | ||
# no copy by default from DataFrame.__init__ | ||
subarr = subarr.copy() | ||
return subarr | ||
|
||
elif isinstance(data, (list, tuple)) and len(data) > 0: | ||
copy = bool(copy) # None -> False | ||
if dtype is not None: | ||
subarr = _try_cast(data, dtype, copy, raise_cast_failure) | ||
else: | ||
|
@@ -446,16 +454,19 @@ def sanitize_array( | |
|
||
elif isinstance(data, range): | ||
# GH#16804 | ||
copy = bool(copy) # None -> False | ||
arr = np.arange(data.start, data.stop, data.step, dtype="int64") | ||
subarr = _try_cast(arr, dtype, copy, raise_cast_failure) | ||
elif isinstance(data, abc.Set): | ||
raise TypeError("Set type is unordered") | ||
elif lib.is_scalar(data) and index is not None and dtype is not None: | ||
copy = bool(copy) # None -> False | ||
data = maybe_cast_to_datetime(data, dtype) | ||
if not lib.is_scalar(data): | ||
data = data[0] | ||
subarr = construct_1d_arraylike_from_scalar(data, len(index), dtype) | ||
else: | ||
copy = bool(copy) # None -> False | ||
subarr = _try_cast(data, dtype, copy, raise_cast_failure) | ||
|
||
# scalar like, GH | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1822,10 +1822,13 @@ def _shape_compat(x): | |
|
||
first = arrays[0] | ||
shape = (len(arrays),) + _shape_compat(first) | ||
|
||
stacked = np.empty(shape, dtype=dtype) | ||
for i, arr in enumerate(arrays): | ||
stacked[i] = _asarray_compat(arr) | ||
if len(arrays) == 1: | ||
# allow for 0-copy construction from dict | ||
stacked = _asarray_compat(first).reshape(shape) | ||
else: | ||
stacked = np.empty(shape, dtype=dtype) | ||
for i, arr in enumerate(arrays): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why can't you use the original loop i think that will also allow 0-copy construction , no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I don't follow what you mean by original loop. The original loop I see is allocating memory and assigning into it, which doesn't allow 0-copy. |
||
stacked[i] = _asarray_compat(arr) | ||
|
||
return stacked, placement | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1941,11 +1941,15 @@ def test_constructor_ndarray_copy(self, float_frame): | |
def test_constructor_series_copy(self, float_frame): | ||
series = float_frame._series | ||
|
||
df = DataFrame({"A": series["A"]}) | ||
df = DataFrame({"A": series["A"]}, copy=True) | ||
df["A"][:] = 5 | ||
|
||
assert not (series["A"] == 5).all() | ||
|
||
df = DataFrame({"A": series["A"]}) # no copy by default | ||
df["A"][:] = 5 | ||
assert (series["A"] == 5).all() | ||
|
||
def test_constructor_with_nas(self): | ||
# GH 5016 | ||
# na's in indices | ||
|
@@ -2746,3 +2750,23 @@ def test_construction_from_set_raises(self): | |
msg = "Set type is unordered" | ||
with pytest.raises(TypeError, match=msg): | ||
pd.DataFrame({"a": {1, 2, 3}}) | ||
|
||
@pytest.mark.parametrize("copy", [None, False, True]) | ||
def test_dict_nocopy(self, copy): | ||
a = np.array([1, 2]) | ||
b = pd.array([1, 2]) | ||
df = pd.DataFrame({"a": a, "b": b}, copy=copy) | ||
df.iloc[0, 0] = 0 | ||
df.iloc[0, 1] = 0 | ||
|
||
if copy is None: | ||
# copy for ndarray, no copy for EA | ||
assert a[0] == 1 | ||
assert b[0] == 0 | ||
|
||
elif copy: | ||
assert a[0] == 1 | ||
assert b[0] == 1 | ||
else: | ||
assert a[0] == 0 | ||
assert b[0] == 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger ive just about gotten this working, but this last assertion is still failing. The trouble is that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like this also breaks down if we add a second column with the same dtype as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to 1.2