Skip to content

REGR: __setitem__ with integer slices on Int/RangeIndex is broken (label instead of positional) #31469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Jan 30, 2020 · 15 comments · Fixed by #31515
Closed
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@amueller
Copy link

amueller commented Jan 30, 2020

There's an backward incompatible change in pandas 1.0 that I didn't find in the changelog. I might have just overlooked it though.

import numpy as np
X = pd.DataFrame(np.zeros((100, 1)))
X[-4:] = 1
X

In pandas 0.25.3 or lower, this results in the last four entries of X to be 1 and all the others zero. In pandas 1.0, it results in all entries of X being 1.
I assume it's a change of indexing axis 0 or axis 1?

@amueller
Copy link
Author

I wonder if it's related to #31449 but I'm not using a multi-index.

@MarcoGorelli MarcoGorelli added the Regression Functionality that used to work in a prior pandas version label Jan 30, 2020
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 30, 2020

Thanks for the report.

Seems this doesn't affect .iloc:

In [26]: import numpy as np 
    ...: X = pd.DataFrame(np.zeros((5, 1))) 
    ...: X.iloc[-4:] = 1 
    ...: X                                                                      
Out[26]: 
     0
0  0.0
1  1.0
2  1.0
3  1.0
4  1.0

will look into it

@jreback
Copy link
Contributor

jreback commented Jan 30, 2020

you are label indexing with a slice with loc
since none of the labels exist nothing is set

did this actually work previously?

this should never have worked with .loc

it might have with [] which has fallback integer indexing

@jorisvandenbossche jorisvandenbossche added this to the 1.0.1 milestone Jan 30, 2020
@jorisvandenbossche
Copy link
Member

I do not remember any specific discussion about this, so I think it is definitely a regression.

Slicing rows in [] has always worked positional if there is an integer index (surprising, yes, but longstanding behaviour, see eg my summary of this of 5 years ago #9595)

@jorisvandenbossche jorisvandenbossche added the Indexing Related to indexing on series/frames, not to indexes themselves label Jan 30, 2020
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 30, 2020

since none of the labels exist nothing is set

@jreback if I've understood correctly, the issue is that everything is being set

>>> import numpy as np
>>> import pandas as pd

>>> X = pd.DataFrame(np.zeros((5, 1)))
>>> X                                                                     
     0
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:]  # only prints the last 4 rows of X...
     0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:] = 1
>>> X  # ...but everything (including the first row) has now been set
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

@jreback
Copy link
Contributor

jreback commented Jan 30, 2020

I do not remember any specific discussion about this, so I think it is definitely a regression.

Slicing rows in [] has always worked positional if there is an integer index (surprising, yes, but longstanding behaviour, see eg my summary of this of 5 years ago #9595)

maybe but indexing with an out or range label on both sides should return nothing

so the results are correct

@jreback
Copy link
Contributor

jreback commented Jan 30, 2020

since none of the labels exist nothing is set

@jreback if I've understood correctly, the issue is that everything is being set

>>> import numpy as np
>>> import pandas as pd

>>> X = pd.DataFrame(np.zeros((5, 1)))
>>> X                                                                     
     0
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:]  # only prints the last 4 rows of X...
     0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:] = 1
>>> X  # ...but everything (including the first row) has now been set
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

ahh ok that is not correct; i would expect this indexer to return noting

@jreback
Copy link
Contributor

jreback commented Jan 30, 2020

might be #31393

@amueller
Copy link
Author

ahh ok that is not correct; i would expect this indexer to return noting

Asking for the shape, both in 0.25 and 1.0, you get

>>> X[-4:].shape
(4, 1)

but assignment in version 1.0 assigns to everything.

@jorisvandenbossche
Copy link
Member

maybe but indexing with an out or range label on both sides should return nothing

This is about positional indexing, so there is no "out of range label". The -4 means start from the fourth last element to the end.

Again, I agree this is surprising behaviour. You would think it is label-based indexing, but it is not. I already described this 5 years in ago #9595.

Some examples to illustrate this:

In [21]: df = pd.DataFrame({'a': [0., 1., 2., 3.]}, index=[2, 3, 4, 5])

In [22]: df 
Out[22]: 
     a
2  0.0
3  1.0
4  2.0
5  3.0

In [23]: df[2:] 
Out[23]: 
     a
4  2.0
5  3.0

In [24]: df[:3]  
Out[24]: 
     a
2  0.0
3  1.0
4  2.0

This those examples are for __getitem__, and work clearly positionally if you look at the index of the results (and both on 0.25 and 1.0, and for both Int64Index as RangeIndex).
And so it is __setitem__ is broken in 1.0.0.

@jorisvandenbossche jorisvandenbossche changed the title Indexing change with integer slices not in changelog? REGR: __setitem__ with integer slices on Int/RangeIndex is broken (label instead of positional) Jan 31, 2020
@jorisvandenbossche
Copy link
Member

This is caused by #27383 I think (cc @jbrockmendel ), specifically:

     def _setitem_slice(self, key, value):
         self._check_setitem_copy()
-        self.loc._setitem_with_indexer(key, value)
+        self.loc[key] = value

@amueller
Copy link
Author

Thanks for investigating @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

BTW, I think this is a rather serious regression, since it doesn't give an error, but rather silently modifies/corrupts your data, and thus can silently lead to wrong results. We should probably try to do a 1.0.1 quickly.

@TomAugspurger
Copy link
Contributor

Agreed. I won't be able to this weekend, but perhaps Monday?

I'm hoping to fix up a bunch of the reported regressions today.

@jbrockmendel
Copy link
Member

I'll start a branch reverting the lines @jorisvandenbossche identified and open a PR after confirming that fixes this.

After this is fixed for 1.0.1, we should discuss deprecating the surprising behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants