Skip to content

BUG: should constructing Index from a Series make a copy? #42934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Aug 8, 2021 · 8 comments
Open

BUG: should constructing Index from a Series make a copy? #42934

jorisvandenbossche opened this issue Aug 8, 2021 · 8 comments
Labels
Bug Copy / view semantics Index Related to the Index class or subclasses

Comments

@jorisvandenbossche
Copy link
Member

From a comment of @jbrockmendel at #41878 (comment):

ser = pd.Series(range(5))
idx = pd.Index(ser)
ser[0] = 10
>>> idx[0]
10

In the above, we create an Index from a Series, then mutate the Series, which also updated the Index, while an Index is assumed to be immutable.

Changing the example a bit, you can obtain wrong values with indexing this way:

ser = pd.Series(range(5))
idx = pd.Index(ser)

ser.index = idx

>>> ser[0]
0
>>> ser.iloc[0] = 10
>>> ser[0]
10
>>> ser
10    10
1      1
2      2
3      3
4      4
dtype: int64

So ser[0] is still giving a result, while that key doesn't actually exist in the Series' index at that point.

I know that generally we consider this a user error if you would do this with a numpy array (idx = pd.Index(arr) and mutating the array), but here you get that by only using high-level pandas objects itself. In which case we should prevent this from happening?

@jorisvandenbossche jorisvandenbossche added Bug Index Related to the Index class or subclasses Copy / view semantics labels Aug 8, 2021
@jorisvandenbossche
Copy link
Member Author

A similar example but with a DataFrame and set_index (so even without explicitly doing Index(ser)) that has the same problem, as a good illustration that IMO this is not a "user error" but something we should fix on pandas' side:

In [33]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})

In [36]: df.set_index("a", drop=False, inplace=True)

In [37]: df
Out[37]: 
   a  b    c
a           
1  1  4  0.1
2  2  5  0.2
3  3  6  0.3

In [38]: df.loc[1, 'a']
Out[38]: 1

In [39]: df.iloc[0, 0] = 10

In [40]: df
Out[40]: 
     a  b    c
a             
10  10  4  0.1
2    2  5  0.2
3    3  6  0.3

In [41]: df.loc[1, 'a']
Out[41]: 10

@attack68
Copy link
Contributor

attack68 commented Aug 8, 2021

This probably doesn't fit in this thread but just for consideration alongside this issue is that while index is not mutable index.names is. Recently this threw me in the following case:

idx = pd.Index(["a", "b"])
df = pd.DataFrame([[1,2],[3,4]], columns=idx, index=idx)
df.index.names = ["zzz"]

zzz  a  b
zzz
a    1  2
b    3  4

I wasn't expecting the columns' names to be changed, I intuitively expected a new index object was created by the constructor.

I agree your examples should be corrected.

@jorisvandenbossche
Copy link
Member Author

Hmm, since you are explicitly passing the same Index object as columns and rows index, this can maybe be considered as expected behaviour. Not really sure .. (but indeed a different issue).

@rhshadrach
Copy link
Member

Since properties of the index are assumed immutable, they are cached, and this can lead to invalid states:

ser = pd.Series(range(2))
idx = pd.Index(ser)
idx.is_monotonic_increasing
ser[0] = 10
print(idx)
print('Is monotic increasing:', idx.is_monotonic_increasing)

gives

Int64Index([10, 1], dtype='int64')
Is monotic increasing: True

@ehansis
Copy link

ehansis commented Sep 15, 2021

It seems that this issue caused random SIGSEGV and SIGBUS in my code (MacOS, pandas=1.1.4=py38hcf432d8_0 from conda_forge). Unfortunately, I cannot reproduce them in a minimal example. My code looks something like this:

df = pd.DataFrame([
    [... data here ...]
], columns=["code", "value", "foo"])
df = df.set_index("code", drop=False)
df.loc["some_code", "code"] = "abc"
df.loc["some_other_code", "code"] = "def"

This modifies the index, as described above. However, if I repeat the line

df.loc["some_other_code", "code"] = "def"

once more, the assignment sometimes works (apparently re-using the previous index values) and sometimes crashes the interpreter.

@rhshadrach
Copy link
Member

rhshadrach commented Sep 16, 2021

@ehansis - thanks for adding in here, however this appears to be a separate issue. If you add in print(id(df.index)), you should be seeing different codes, e.g.

139902856885584
139902875398688

before and after the df.loc["some_code", "code"] = "abc" line. This is because pandas is not modifying an index, but rather creating and replacing one.

@ehansis
Copy link

ehansis commented Sep 16, 2021

@rhshadrach OK, thanks, I'll try to check that if I ever manage to get the segmentation faults reproduced. Let me know if I can be of further help.

@jorisvandenbossche
Copy link
Member Author

BTW, the other way around, when constructing a Series from an Index, we do take a copy to avoid such issues.
In Series.__init__:

elif isinstance(data, Index):
if dtype is not None:
# astype copies
data = data.astype(dtype)
else:
# GH#24096 we need to ensure the index remains immutable
data = data._values.copy()
copy = False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Copy / view semantics Index Related to the Index class or subclasses
Projects
None yet
Development

No branches or pull requests

4 participants