Skip to content

API: Honor copy for dict-input in DataFrame #34872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jun 19, 2020

Closes #32960.

Still need a whatsnew. The tl/dr is

  1. We can't honor no-copy for a dict with multiple values of the same dtype: DataFrame({"A": np.array([1, 2]), "B": np.array([1, 2])}) as long as we have consolidation.
    2. Currently, this is an API breaking change for something like pd.DataFrame({"A": np.array([1, 2])}). To resolve that, I think we'll need something like a default of copy=None, and then require copy=False to enable no-copy for dict inputs (but keep no copy by default for 2d ndarrays and data frames).

Unsure how we're going to reconcile that with the behavior of EAs.

EDIT: I've changed the default copy to None, which means

  1. no copy for ndarray / dataframe input (copy=True to copy)
  2. copy for dict input (copy=False to not copy).

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - minor comments

@TomAugspurger
Copy link
Contributor Author

So it seems that, at least for sparse, we had a test asserting that we did not copy DataFrame({"A": sparse_array}) by default. So I don't think we can restore the pre-1.0 behavior of copying.

So my recommendation is to just always honor copy for dict-inputs when we can.

@TomAugspurger
Copy link
Contributor Author

Ahh going back and forth on this. Right now this is API breaking for a dict of ndarrays... Will need to think on it more. I don't like it but I think we'll need to treat EAs and ndarrays differently, at least by default.

@TomAugspurger
Copy link
Contributor Author

OK, in the latest version we should be backwards compatible. By default we will

  1. copy ndarrays
  2. not copy EAs
In [6]: a = pd.array([1, 2])

In [7]: b = np.array([1, 2])

In [8]: df = pd.DataFrame({"A": a, "B": b})

In [9]: df.iloc[0, 0] = 0

In [10]: df.iloc[0, 1] = 0

In [11]: a
Out[11]:
<IntegerArray>
[0, 2]
Length: 2, dtype: Int64

In [12]: b
Out[12]: array([1, 2])

specifying copy=True/False will make things always copy or not, unless you have multiple values consolidated into a single block, in which chase there's an unavoidable copy.

@gfyoung gfyoung added API Design DataFrame DataFrame data structure Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 20, 2020
@@ -412,6 +412,9 @@ def sanitize_array(

# GH#846
if isinstance(data, np.ndarray):
if copy is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we ever wanted to reconcile the behavior between ndarray and extension array, we'd change this or L434.

IMO that's not worth the noise.

@TomAugspurger
Copy link
Contributor Author

Are people OK with #34872 (comment)? https://github.com/pandas-dev/pandas/pull/34872/files#diff-1e79abbbdd150d4771b91ea60a4e1cc7R361 has a description of the proposed behavior.

@WillAyd
Copy link
Member

WillAyd commented Jun 29, 2020

What is the reason for excepting EAs contained in a dict? I think would be nice for them to behave the same was as numpy arrays

@TomAugspurger
Copy link
Contributor Author

What is the reason for excepting EAs contained in a dict?

Excepting from a copy by default? That's the existing behavior for dict of EAs. This PR is making it so that an explicit copy=True/False will / won't copy.

@WillAyd
Copy link
Member

WillAyd commented Jun 29, 2020

Maybe misunderstanding something but I was asking about the default situation, i.e. when a user doesn't provide copy explicitly. What's the reason for the behavior to diverge between EAs / ndarrays in that case?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be changing a long standing default, which would need a deprecation cycle.

@@ -412,6 +412,9 @@ def sanitize_array(

# GH#846
if isinstance(data, np.ndarray):
if copy is None:
# copy by default for DataFrame({"A": ndarray})
copy = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are changing the default here? this certainly needs a deprecation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#34872 (comment) has the best summary. There's no API-breaking changes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, umm why aren't we not copying ndarrays though? I mean the default is false, happy to take a view for ndarrays' but this seems strange to do opposite things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I don't think we should change it here.

If we want to deprecate the default behavior for ndarray or EA then that can be discussed as a followup. It should be relatively simple compared to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should do this, it is a horribly divergent behavior. We are deprecating tiny little things, this is a major change and am -1 unless this has the same behaviror for EA and ndarrays.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jul 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make API breaking changes in a minor release. And I don't see a cleaner way to solve #32960 than making copy apply to dict inputs too.

I'm fine with aligning the default behavior between ndarray and EA, but that will have to wait for 2.0. And I think that the PR implementing the deprecation should be a followup since it'll introduce a bunch of uninformative changes in the tests making the PR harder to review.

stacked = _asarray_compat(first).reshape(shape)
else:
stacked = np.empty(shape, dtype=dtype)
for i, arr in enumerate(arrays):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you use the original loop i think that will also allow 0-copy construction , no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow what you mean by original loop.

The original loop I see is allocating memory and assigning into it, which doesn't allow 0-copy.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 30, 2020

What's the reason for the behavior to diverge between EAs / ndarrays in that case?

I don't know the original reason (perhaps consolidation?), but that's the existing behavior. IMO it's not worth changing at this point, but we can consider it.

@TomAugspurger TomAugspurger added this to the 1.1 milestone Jul 6, 2020
@TomAugspurger TomAugspurger added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 6, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments

@@ -261,6 +261,7 @@ Other enhancements
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`).
- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made (:issue:`32960`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add that we were previously ignoring this (expliciting stating what you are implying)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this

@@ -412,6 +412,9 @@ def sanitize_array(

# GH#846
if isinstance(data, np.ndarray):
if copy is None:
# copy by default for DataFrame({"A": ndarray})
copy = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, umm why aren't we not copying ndarrays though? I mean the default is false, happy to take a view for ndarrays' but this seems strange to do opposite things.

@TomAugspurger TomAugspurger removed the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 7, 2020
@@ -261,6 +261,7 @@ Other enhancements
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`).
- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made (:issue:`32960`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this

@@ -412,6 +412,9 @@ def sanitize_array(

# GH#846
if isinstance(data, np.ndarray):
if copy is None:
# copy by default for DataFrame({"A": ndarray})
copy = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should do this, it is a horribly divergent behavior. We are deprecating tiny little things, this is a major change and am -1 unless this has the same behaviror for EA and ndarrays.

@jreback jreback removed this from the 1.1 milestone Jul 9, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone Jul 12, 2020
@jreback
Copy link
Contributor

jreback commented Jul 13, 2020

same here with this PR, this should wait for 1.2

@TomAugspurger
Copy link
Contributor Author

Then we'll need to figure out if there are any API breaking changes reported in #32960. At a glance, I don't think there are any. I think #32831 was identified as the reason for the new behavior (which is a bugfix) so we should be OK leaving this till 1.2, but would appreciate a confirmation.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 15, 2020

I think this is OK holding for 1.2 if necessary (though I think it's also fine to do now). There won't be a great way to avoid the issue reported at #32960. Users will need to copy the values prior to passing them to pandas.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2020

I still -1 on this. I think this introduces very odd behavior. I would be ok doing this for 2.0, but this is breaking a lot.

@TomAugspurger
Copy link
Contributor Author

Can you check again? It's not making any breaking changes.

@@ -412,6 +412,9 @@ def sanitize_array(

# GH#846
if isinstance(data, np.ndarray):
if copy is None:
# copy by default for DataFrame({"A": ndarray})
copy = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you make this False i would be ok with this; True is a breaking change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master, we copy for dict of ndarrays

In [4]: import numpy as np

In [5]: import pandas as pd
a
In [6]: a = np.array([1, 2])

In [7]: df = pd.DataFrame({"A": a})

In [8]: df.iloc[0, 0] = 10

In [9]: a
Out[9]: array([1, 2])

Perhaps it's not clear from the diff, but this section is hit only for dict inputs. We continue to not copy by default for ndarray data (on this branch):

In [4]: a = np.array([1, 2])

In [5]: df = pd.DataFrame({"A": a})

In [6]: df.iloc[0, 0] = 10

In [7]: a
Out[7]: array([1, 2])

In [8]: b = np.ones((4, 4))

In [9]: df = pd.DataFrame(b)

In [10]: df.iloc[0, 0] = 10

In [11]: b
Out[11]:
array([[10.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, now that I see your objection, other uses of sanitize_array will need to be checked. I've only updated DataFrame.__init__, but I'll need to ensure other places use copy=False if they were relying on the default.

@jbrockmendel
Copy link
Member

We can't honor no-copy for a dict with multiple values of the same dtype: DataFrame({"A": np.array([1, 2]), "B": np.array([1, 2])}) as long as we have consolidation.

That consolidation happens in create_block_manager_from_arrays. i think we need the copy keyword to propagate down to there and not consolidate when copy=False.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 12, 2020
@jbrockmendel
Copy link
Member

i think this is worth doing, and that something like #36894 should be incorporated

@simonjayhawkins
Copy link
Member

@TomAugspurger can you resolve merge conflicts and move release note to 1.2

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like the fact that we are changing the default for ndarray for EA even if only for dict-like inputs, but I guess its ok, because in practice these would be consolidated anyhow right? so practially there is another copy but its generally nbd.

pls rebase and confirm if the above is true (and do we need more testing to ensure not breaking other uses of sanitize_array)?

@@ -287,6 +287,7 @@ Other enhancements
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` method that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`, :issue:`34392`).
- The :class:`DataFrame` constructor now uses ``copy`` for dict-inputs to control whether copies of the arrays are made, rather than ignoring it (:issue:`32960`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 1.2

@jbrockmendel
Copy link
Member

@TomAugspurger if you dont plan on finishing this anytime soon, mind if i take it up? i think itd be nice to fix and have opinions about how it should be implemented

@TomAugspurger
Copy link
Contributor Author

Yeah, that'd be great, thanks.

I'll leave this branch around in case you want to take the tests.

assert b[0] == 1
else:
assert a[0] == 0
assert b[0] == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger ive just about gotten this working, but this last assertion is still failing. The trouble is that the df.iloc[0, 1] = 0 line ends up calling ExtensionBlock.set_inplace, which incorrectly sets an entire new array on the block (i.e. #35417)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this also breaks down if we add a second column with the same dtype as a, since it consolidates on the iloc.__setitem__

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataframe change alters original array used in creation
6 participants