API: Honor copy for dict-input in DataFrame #34872

TomAugspurger · 2020-06-19T15:39:16Z

Closes #32960.

~~Still need a whatsnew. The tl/dr is~~

We can't honor no-copy for a dict with multiple values of the same dtype: DataFrame({"A": np.array([1, 2]), "B": np.array([1, 2])}) as long as we have consolidation.
2. Currently, this is an API breaking change for something like pd.DataFrame({"A": np.array([1, 2])}). To resolve that, I think we'll need something like a default of copy=None, and then require copy=False to enable no-copy for dict inputs (but keep no copy by default for 2d ndarrays and data frames).

Unsure how we're going to reconcile that with the behavior of EAs.

EDIT: I've changed the default copy to None, which means

no copy for ndarray / dataframe input (copy=True to copy)
copy for dict input (copy=False to not copy).

Closes pandas-dev#32960

WillAyd

looks good - minor comments

pandas/tests/frame/test_constructors.py

pandas/core/internals/construction.py

TomAugspurger · 2020-06-19T19:52:40Z

So it seems that, at least for sparse, we had a test asserting that we did not copy DataFrame({"A": sparse_array}) by default. So I don't think we can restore the pre-1.0 behavior of copying.

So my recommendation is to just always honor copy for dict-inputs when we can.

TomAugspurger · 2020-06-19T20:09:02Z

Ahh going back and forth on this. Right now this is API breaking for a dict of ndarrays... Will need to think on it more. I don't like it but I think we'll need to treat EAs and ndarrays differently, at least by default.

TomAugspurger · 2020-06-19T20:33:36Z

OK, in the latest version we should be backwards compatible. By default we will

copy ndarrays
not copy EAs

In [6]: a = pd.array([1, 2])

In [7]: b = np.array([1, 2])

In [8]: df = pd.DataFrame({"A": a, "B": b})

In [9]: df.iloc[0, 0] = 0

In [10]: df.iloc[0, 1] = 0

In [11]: a
Out[11]:
<IntegerArray>
[0, 2]
Length: 2, dtype: Int64

In [12]: b
Out[12]: array([1, 2])

specifying copy=True/False will make things always copy or not, unless you have multiple values consolidated into a single block, in which chase there's an unavoidable copy.

TomAugspurger · 2020-06-22T16:24:28Z

pandas/core/construction.py

@@ -412,6 +412,9 @@ def sanitize_array(

    # GH#846
    if isinstance(data, np.ndarray):
+        if copy is None:


So if we ever wanted to reconcile the behavior between ndarray and extension array, we'd change this or L434.

IMO that's not worth the noise.

TomAugspurger · 2020-06-29T21:18:06Z

Are people OK with #34872 (comment)? https://github.com/pandas-dev/pandas/pull/34872/files#diff-1e79abbbdd150d4771b91ea60a4e1cc7R361 has a description of the proposed behavior.

WillAyd · 2020-06-29T21:25:39Z

What is the reason for excepting EAs contained in a dict? I think would be nice for them to behave the same was as numpy arrays

TomAugspurger · 2020-06-29T21:43:21Z

What is the reason for excepting EAs contained in a dict?

Excepting from a copy by default? That's the existing behavior for dict of EAs. This PR is making it so that an explicit copy=True/False will / won't copy.

WillAyd · 2020-06-29T22:00:53Z

Maybe misunderstanding something but I was asking about the default situation, i.e. when a user doesn't provide copy explicitly. What's the reason for the behavior to diverge between EAs / ndarrays in that case?

jreback

this appears to be changing a long standing default, which would need a deprecation cycle.

jreback · 2020-06-29T22:53:26Z

pandas/core/construction.py

@@ -412,6 +412,9 @@ def sanitize_array(

    # GH#846
    if isinstance(data, np.ndarray):
+        if copy is None:
+            # copy by default for DataFrame({"A": ndarray})
+            copy = True


so you are changing the default here? this certainly needs a deprecation

#34872 (comment) has the best summary. There's no API-breaking changes here.

ok, umm why aren't we not copying ndarrays though? I mean the default is false, happy to take a view for ndarrays' but this seems strange to do opposite things.

I'm not sure. I don't think we should change it here.

If we want to deprecate the default behavior for ndarray or EA then that can be discussed as a followup. It should be relatively simple compared to this PR.

I do not think we should do this, it is a horribly divergent behavior. We are deprecating tiny little things, this is a major change and am -1 unless this has the same behaviror for EA and ndarrays.

We can't make API breaking changes in a minor release. And I don't see a cleaner way to solve #32960 than making copy apply to dict inputs too.

I'm fine with aligning the default behavior between ndarray and EA, but that will have to wait for 2.0. And I think that the PR implementing the deprecation should be a followup since it'll introduce a bunch of uninformative changes in the tests making the PR harder to review.

jreback · 2020-06-29T22:54:42Z

pandas/core/internals/managers.py

+        stacked = _asarray_compat(first).reshape(shape)
+    else:
+        stacked = np.empty(shape, dtype=dtype)
+        for i, arr in enumerate(arrays):


why can't you use the original loop i think that will also allow 0-copy construction , no?

Sorry, I don't follow what you mean by original loop.

The original loop I see is allocating memory and assigning into it, which doesn't allow 0-copy.

TomAugspurger · 2020-06-30T11:01:53Z

What's the reason for the behavior to diverge between EAs / ndarrays in that case?

I don't know the original reason (perhaps consolidation?), but that's the existing behavior. IMO it's not worth changing at this point, but we can consider it.

jreback

see my comments

jreback · 2020-07-06T23:32:38Z

doc/source/whatsnew/v1.1.0.rst

@@ -261,6 +261,7 @@ Other enhancements
 - :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
 - :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
 - :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`).
+- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made (:issue:`32960`)


can you add that we were previously ignoring this (expliciting stating what you are implying)

can you update this

jreback · 2020-07-06T23:33:56Z

pandas/core/construction.py

@@ -412,6 +412,9 @@ def sanitize_array(

    # GH#846
    if isinstance(data, np.ndarray):
+        if copy is None:
+            # copy by default for DataFrame({"A": ndarray})
+            copy = True


ok, umm why aren't we not copying ndarrays though? I mean the default is false, happy to take a view for ndarrays' but this seems strange to do opposite things.

jreback · 2020-07-09T23:24:58Z

doc/source/whatsnew/v1.1.0.rst

@@ -261,6 +261,7 @@ Other enhancements
 - :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
 - :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
 - :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`).
+- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made (:issue:`32960`)


can you update this

jreback · 2020-07-09T23:25:53Z

pandas/core/construction.py

@@ -412,6 +412,9 @@ def sanitize_array(

    # GH#846
    if isinstance(data, np.ndarray):
+        if copy is None:
+            # copy by default for DataFrame({"A": ndarray})
+            copy = True


I do not think we should do this, it is a horribly divergent behavior. We are deprecating tiny little things, this is a major change and am -1 unless this has the same behaviror for EA and ndarrays.

jreback · 2020-07-13T13:30:16Z

same here with this PR, this should wait for 1.2

TomAugspurger · 2020-07-13T14:22:32Z

Then we'll need to figure out if there are any API breaking changes reported in #32960. At a glance, I don't think there are any. I think #32831 was identified as the reason for the new behavior (which is a bugfix) so we should be OK leaving this till 1.2, but would appreciate a confirmation.

TomAugspurger · 2020-07-15T19:14:19Z

I think this is OK holding for 1.2 if necessary (though I think it's also fine to do now). There won't be a great way to avoid the issue reported at #32960. Users will need to copy the values prior to passing them to pandas.

jreback · 2020-07-15T22:43:55Z

I still -1 on this. I think this introduces very odd behavior. I would be ok doing this for 2.0, but this is breaking a lot.

TomAugspurger · 2020-07-16T01:57:09Z

Can you check again? It's not making any breaking changes.

jreback · 2020-07-16T11:51:34Z

pandas/core/construction.py

@@ -412,6 +412,9 @@ def sanitize_array(

    # GH#846
    if isinstance(data, np.ndarray):
+        if copy is None:
+            # copy by default for DataFrame({"A": ndarray})
+            copy = True


if you make this False i would be ok with this; True is a breaking change

On master, we copy for dict of ndarrays

In [4]: import numpy as np In [5]: import pandas as pd a In [6]: a = np.array([1, 2]) In [7]: df = pd.DataFrame({"A": a}) In [8]: df.iloc[0, 0] = 10 In [9]: a Out[9]: array([1, 2])

Perhaps it's not clear from the diff, but this section is hit only for dict inputs. We continue to not copy by default for ndarray data (on this branch):

In [4]: a = np.array([1, 2]) In [5]: df = pd.DataFrame({"A": a}) In [6]: df.iloc[0, 0] = 10 In [7]: a Out[7]: array([1, 2]) In [8]: b = np.ones((4, 4)) In [9]: df = pd.DataFrame(b) In [10]: df.iloc[0, 0] = 10 In [11]: b Out[11]: array([[10., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]])

Although, now that I see your objection, other uses of sanitize_array will need to be checked. I've only updated DataFrame.__init__, but I'll need to ensure other places use copy=False if they were relying on the default.

jbrockmendel · 2020-09-11T19:52:54Z

We can't honor no-copy for a dict with multiple values of the same dtype: DataFrame({"A": np.array([1, 2]), "B": np.array([1, 2])}) as long as we have consolidation.

That consolidation happens in create_block_manager_from_arrays. i think we need the copy keyword to propagate down to there and not consolidate when copy=False.

github-actions · 2020-10-12T00:15:49Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jbrockmendel · 2020-10-12T00:24:12Z

i think this is worth doing, and that something like #36894 should be incorporated

simonjayhawkins · 2020-11-12T10:59:52Z

@TomAugspurger can you resolve merge conflicts and move release note to 1.2

jreback

I still don't like the fact that we are changing the default for ndarray for EA even if only for dict-like inputs, but I guess its ok, because in practice these would be consolidated anyhow right? so practially there is another copy but its generally nbd.

pls rebase and confirm if the above is true (and do we need more testing to ensure not breaking other uses of sanitize_array)?

jreback · 2020-11-18T18:26:01Z

doc/source/whatsnew/v1.1.0.rst

@@ -287,6 +287,7 @@ Other enhancements
 - :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
 - :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
 - :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`).
+- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made, rather than ignoring it (:issue:`32960`)


can you move to 1.2

jbrockmendel · 2020-11-19T02:19:41Z

@TomAugspurger if you dont plan on finishing this anytime soon, mind if i take it up? i think itd be nice to fix and have opinions about how it should be implemented

TomAugspurger · 2020-11-19T12:57:52Z

Yeah, that'd be great, thanks.

I'll leave this branch around in case you want to take the tests.

jbrockmendel · 2020-12-09T02:21:32Z

pandas/tests/frame/test_constructors.py

+            assert b[0] == 1
+        else:
+            assert a[0] == 0
+            assert b[0] == 0


@TomAugspurger ive just about gotten this working, but this last assertion is still failing. The trouble is that the df.iloc[0, 1] = 0 line ends up calling ExtensionBlock.set_inplace, which incorrectly sets an entire new array on the block (i.e. #35417)

looks like this also breaks down if we add a second column with the same dtype as a, since it consolidates on the iloc.__setitem__

API: Honor copy for dict-input in DataFrame

7b892bd

Closes pandas-dev#32960

WillAyd reviewed Jun 19, 2020

View reviewed changes

pandas/tests/frame/test_constructors.py Outdated Show resolved Hide resolved

pandas/core/internals/construction.py Outdated Show resolved Hide resolved

TomAugspurger added 3 commits June 19, 2020 14:07

Fixups

acf99dd

copy

499080b

fixup

20c87ce

simplify

b0b125d

optional

f9b3f16

gfyoung added API Design DataFrame DataFrame data structure Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 20, 2020

TomAugspurger added 2 commits June 22, 2020 11:10

Merge remote-tracking branch 'upstream/master' into 32960-copy-dict

8bf9f73

Fixup

306d015

TomAugspurger commented Jun 22, 2020

View reviewed changes

jreback requested changes Jun 29, 2020

View reviewed changes

TomAugspurger added this to the 1.1 milestone Jul 6, 2020

TomAugspurger added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 6, 2020

jreback requested changes Jul 6, 2020

View reviewed changes

TomAugspurger removed the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 7, 2020

jreback requested changes Jul 9, 2020

View reviewed changes

jreback removed this from the 1.1 milestone Jul 9, 2020

TomAugspurger added this to the 1.1 milestone Jul 12, 2020

Merge remote-tracking branch 'upstream/master' into 32960-copy-dict

bb730cc

fix comment

9f716c8

Merge remote-tracking branch 'upstream/master' into 32960-copy-dict

a9db888

jreback requested changes Jul 16, 2020

View reviewed changes

TomAugspurger mentioned this pull request Jul 16, 2020

Dataframe change alters original array used in creation #32960

Closed

TomAugspurger modified the milestones: 1.1, 1.2 Jul 27, 2020

jreback mentioned this pull request Sep 23, 2020

BUG: df1.values is df1_shallow_copy.values returns false #36571

Open

3 tasks

jbrockmendel mentioned this pull request Oct 5, 2020

ENH: allow non-consolidation in constructors #36894

Closed

5 tasks

github-actions bot added the Stale label Oct 12, 2020

jreback requested changes Nov 18, 2020

View reviewed changes

TomAugspurger closed this Nov 19, 2020

jbrockmendel reviewed Dec 9, 2020

View reviewed changes

jbrockmendel mentioned this pull request Jan 4, 2021

API: honor copy=True when passing dict to DataFrame #38939

Merged

4 tasks

API: Honor copy for dict-input in DataFrame #34872

API: Honor copy for dict-input in DataFrame #34872

Conversation

TomAugspurger commented Jun 19, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 19, 2020

TomAugspurger commented Jun 19, 2020

TomAugspurger commented Jun 19, 2020

Choose a reason for hiding this comment

TomAugspurger commented Jun 29, 2020

WillAyd commented Jun 29, 2020

TomAugspurger commented Jun 29, 2020

WillAyd commented Jun 29, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Jul 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 30, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 13, 2020

TomAugspurger commented Jul 13, 2020

TomAugspurger commented Jul 15, 2020 • edited Loading

jreback commented Jul 15, 2020

TomAugspurger commented Jul 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 11, 2020

github-actions bot commented Oct 12, 2020

jbrockmendel commented Oct 12, 2020

simonjayhawkins commented Nov 12, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 19, 2020

TomAugspurger commented Nov 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 19, 2020 •

edited

Loading

TomAugspurger Jul 12, 2020 •

edited

Loading

TomAugspurger commented Jun 30, 2020 •

edited

Loading

TomAugspurger commented Jul 15, 2020 •

edited

Loading