Skip to content

DEPR: behavior of copy argument in df.reindex is confusing #34663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fujiaxiang opened this issue Jun 9, 2020 · 4 comments
Open

DEPR: behavior of copy argument in df.reindex is confusing #34663

fujiaxiang opened this issue Jun 9, 2020 · 4 comments
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action

Comments

@fujiaxiang
Copy link
Member

This was inspired by #33888 and #34584

Problem description

The behavior of copy argument in df.reindex is confusing. Current docstring does it explain it sufficiently clear. It also seems to me copy is unnecessary.

Currently the docstring says

...

A new object is produced unless the new index is equivalent to the current one and ``copy=False``.

...

copy : bool, default True
       Return a new object, even if the passed indexes are the same.

It is hard to clarify what is considered an "equivalent" index. See below for more details.

Further, I believe users rarely purposefully tries to reindex with an "equivalent" index. It happens only if the user does not yet know the current index or the index to conform to, in which case a consistent behavior (e.g. always return new object) is probably preferred.

# On current master
>>> pd.__version__
'1.1.0.dev0+1802.g942beba1e'

>>> df = pd.DataFrame(range(3))
>>> df
   0
0  0
1  1
2  2
>>> df.index
RangeIndex(start=0, stop=3, step=1)

# not equivalent
>>> df is df.reindex(range(3), copy=False)
False

# not equivalent
>>> df is df.reindex(list(range(3)), copy=False)
False

# equivalent
>>> df is df.reindex(pd.RangeIndex(start=0, stop=3, step=1), copy=False)
True

>>> df = pd.DataFrame(range(3), index=list(range(3)))
>>> df
   0
0  0
1  1
2  2
>>> df.index
Int64Index([0, 1, 2], dtype='int64')

# not equivalent
>>> df is df.reindex(range(3), copy=False)
False

# even this is considered not equivalent
>>> df is df.reindex(list(range(3)), copy=False)
False

>>> df is df.reindex(pd.Int64Index([0, 1, 2]), copy=False)
True

You can see it is actually pretty strict to be "equivalent". I feel it does really make sense to have this copy parameter because reindex will return a new object in most cases anyway even when copy=False.

So the question is, can we deprecate copy?

@fujiaxiang fujiaxiang added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jun 9, 2020
@fujiaxiang fujiaxiang added Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jun 9, 2020
@jorisvandenbossche
Copy link
Member

Even though the "equivalent" criterion is maybe too strict or not very clear (two things that might be worth solving for sure), there is still value in having this keyword for the case you really have an identical index.
For example, the copy=False argument is used quite a lot internally to avoid copies. In some operations, input needs be reindexed (although this is often done through align, but that is under the hood based on the same code as reindex), and in those cases you want to avoid actually needing to reindex when the other object has the same index (eg df + df)

@fujiaxiang fujiaxiang changed the title DEPR: behavior of copy argument in df.reindex is confusing and doesn't achieve anything DEPR: behavior of copy argument in df.reindex is confusing Jun 9, 2020
@gauravchandok
Copy link

The copy parameter directly contradicts that a new object is created when the copy=false argument meets a rare "equivalent" index. Although perhaps stating it as a special note might bring some attention to the how unique this is might help. It seems to be the only exception to the reindex summary which is why I tried to keep it in.

@lijier6
Copy link

lijier6 commented Jul 7, 2020

Or maybe just leave it as a parameter that user define to return an object or not?

copy=False means do reindex on the original dataframe.

@jorisvandenbossche jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action and removed Deprecate Functionality to remove in pandas labels Jul 8, 2020
@jorisvandenbossche
Copy link
Member

The copy parameter directly contradicts that a new object is created when the copy=false argument meets a rare "equivalent" index.

Again this is something that could be clarified in the docstring, but this behaviour of copy is very similar to for example numpy's handling of the copy argument on array coercion: eg np.array(some_obj, copy=False) doesn't mean that no copy is made, but only means that "a copy is avoided when possible", while copy=True means to simply always copy the input.

@mroeschke mroeschke added Deprecate Functionality to remove in pandas and removed API Design labels Aug 7, 2021
@pandas-dev pandas-dev deleted a comment from 2bon Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants