-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Honor copy for dict-input in DataFrame #34872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good - minor comments
So it seems that, at least for sparse, we had a test asserting that we did not copy So my recommendation is to just always honor |
Ahh going back and forth on this. Right now this is API breaking for a dict of ndarrays... Will need to think on it more. I don't like it but I think we'll need to treat EAs and ndarrays differently, at least by default. |
OK, in the latest version we should be backwards compatible. By default we will
In [6]: a = pd.array([1, 2])
In [7]: b = np.array([1, 2])
In [8]: df = pd.DataFrame({"A": a, "B": b})
In [9]: df.iloc[0, 0] = 0
In [10]: df.iloc[0, 1] = 0
In [11]: a
Out[11]:
<IntegerArray>
[0, 2]
Length: 2, dtype: Int64
In [12]: b
Out[12]: array([1, 2]) specifying |
@@ -412,6 +412,9 @@ def sanitize_array( | |||
|
|||
# GH#846 | |||
if isinstance(data, np.ndarray): | |||
if copy is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we ever wanted to reconcile the behavior between ndarray and extension array, we'd change this or L434.
IMO that's not worth the noise.
Are people OK with #34872 (comment)? https://github.com/pandas-dev/pandas/pull/34872/files#diff-1e79abbbdd150d4771b91ea60a4e1cc7R361 has a description of the proposed behavior. |
What is the reason for excepting EAs contained in a dict? I think would be nice for them to behave the same was as numpy arrays |
Excepting from a copy by default? That's the existing behavior for dict of EAs. This PR is making it so that an explicit |
Maybe misunderstanding something but I was asking about the default situation, i.e. when a user doesn't provide copy explicitly. What's the reason for the behavior to diverge between EAs / ndarrays in that case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be changing a long standing default, which would need a deprecation cycle.
@@ -412,6 +412,9 @@ def sanitize_array( | |||
|
|||
# GH#846 | |||
if isinstance(data, np.ndarray): | |||
if copy is None: | |||
# copy by default for DataFrame({"A": ndarray}) | |||
copy = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you are changing the default here? this certainly needs a deprecation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#34872 (comment) has the best summary. There's no API-breaking changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, umm why aren't we not copying ndarrays though? I mean the default is false, happy to take a view for ndarrays' but this seems strange to do opposite things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I don't think we should change it here.
If we want to deprecate the default behavior for ndarray or EA then that can be discussed as a followup. It should be relatively simple compared to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should do this, it is a horribly divergent behavior. We are deprecating tiny little things, this is a major change and am -1 unless this has the same behaviror for EA and ndarrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't make API breaking changes in a minor release. And I don't see a cleaner way to solve #32960 than making copy
apply to dict inputs too.
I'm fine with aligning the default behavior between ndarray and EA, but that will have to wait for 2.0. And I think that the PR implementing the deprecation should be a followup since it'll introduce a bunch of uninformative changes in the tests making the PR harder to review.
stacked = _asarray_compat(first).reshape(shape) | ||
else: | ||
stacked = np.empty(shape, dtype=dtype) | ||
for i, arr in enumerate(arrays): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you use the original loop i think that will also allow 0-copy construction , no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't follow what you mean by original loop.
The original loop I see is allocating memory and assigning into it, which doesn't allow 0-copy.
I don't know the original reason (perhaps consolidation?), but that's the existing behavior. IMO it's not worth changing at this point, but we can consider it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -261,6 +261,7 @@ Other enhancements | |||
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`) | |||
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`) | |||
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`). | |||
- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made (:issue:`32960`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add that we were previously ignoring this (expliciting stating what you are implying)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update this
@@ -412,6 +412,9 @@ def sanitize_array( | |||
|
|||
# GH#846 | |||
if isinstance(data, np.ndarray): | |||
if copy is None: | |||
# copy by default for DataFrame({"A": ndarray}) | |||
copy = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, umm why aren't we not copying ndarrays though? I mean the default is false, happy to take a view for ndarrays' but this seems strange to do opposite things.
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -261,6 +261,7 @@ Other enhancements | |||
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`) | |||
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`) | |||
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`). | |||
- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made (:issue:`32960`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update this
@@ -412,6 +412,9 @@ def sanitize_array( | |||
|
|||
# GH#846 | |||
if isinstance(data, np.ndarray): | |||
if copy is None: | |||
# copy by default for DataFrame({"A": ndarray}) | |||
copy = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should do this, it is a horribly divergent behavior. We are deprecating tiny little things, this is a major change and am -1 unless this has the same behaviror for EA and ndarrays.
same here with this PR, this should wait for 1.2 |
I think this is OK holding for 1.2 if necessary (though I think it's also fine to do now). There won't be a great way to avoid the issue reported at #32960. Users will need to copy the values prior to passing them to pandas. |
I still -1 on this. I think this introduces very odd behavior. I would be ok doing this for 2.0, but this is breaking a lot. |
Can you check again? It's not making any breaking changes. |
@@ -412,6 +412,9 @@ def sanitize_array( | |||
|
|||
# GH#846 | |||
if isinstance(data, np.ndarray): | |||
if copy is None: | |||
# copy by default for DataFrame({"A": ndarray}) | |||
copy = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you make this False i would be ok with this; True is a breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On master, we copy for dict of ndarrays
In [4]: import numpy as np
In [5]: import pandas as pd
a
In [6]: a = np.array([1, 2])
In [7]: df = pd.DataFrame({"A": a})
In [8]: df.iloc[0, 0] = 10
In [9]: a
Out[9]: array([1, 2])
Perhaps it's not clear from the diff, but this section is hit only for dict inputs. We continue to not copy by default for ndarray data
(on this branch):
In [4]: a = np.array([1, 2])
In [5]: df = pd.DataFrame({"A": a})
In [6]: df.iloc[0, 0] = 10
In [7]: a
Out[7]: array([1, 2])
In [8]: b = np.ones((4, 4))
In [9]: df = pd.DataFrame(b)
In [10]: df.iloc[0, 0] = 10
In [11]: b
Out[11]:
array([[10., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, now that I see your objection, other uses of sanitize_array
will need to be checked. I've only updated DataFrame.__init__
, but I'll need to ensure other places use copy=False if they were relying on the default.
That consolidation happens in create_block_manager_from_arrays. i think we need the copy keyword to propagate down to there and not consolidate when copy=False. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
i think this is worth doing, and that something like #36894 should be incorporated |
@TomAugspurger can you resolve merge conflicts and move release note to 1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like the fact that we are changing the default for ndarray for EA even if only for dict-like inputs, but I guess its ok, because in practice these would be consolidated anyhow right? so practially there is another copy but its generally nbd.
pls rebase and confirm if the above is true (and do we need more testing to ensure not breaking other uses of sanitize_array)?
@@ -287,6 +287,7 @@ Other enhancements | |||
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`) | |||
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`) | |||
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`). | |||
- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made, rather than ignoring it (:issue:`32960`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to 1.2
@TomAugspurger if you dont plan on finishing this anytime soon, mind if i take it up? i think itd be nice to fix and have opinions about how it should be implemented |
Yeah, that'd be great, thanks. I'll leave this branch around in case you want to take the tests. |
assert b[0] == 1 | ||
else: | ||
assert a[0] == 0 | ||
assert b[0] == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger ive just about gotten this working, but this last assertion is still failing. The trouble is that the df.iloc[0, 1] = 0
line ends up calling ExtensionBlock.set_inplace
, which incorrectly sets an entire new array on the block (i.e. #35417)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this also breaks down if we add a second column with the same dtype as a
, since it consolidates on the iloc.__setitem__
Closes #32960.
Still need a whatsnew. The tl/dr isDataFrame({"A": np.array([1, 2]), "B": np.array([1, 2])})
as long as we have consolidation.2. Currently, this is an API breaking change for something likepd.DataFrame({"A": np.array([1, 2])})
. To resolve that, I think we'll need something like a default ofcopy=None
, and then requirecopy=False
to enable no-copy for dict inputs (but keep no copy by default for 2d ndarrays and data frames).Unsure how we're going to reconcile that with the behavior of EAs.
EDIT: I've changed the default
copy
toNone
, which meanscopy=True
to copy)copy=False
to not copy).