BUG: should constructing Index from a Series make a copy? #42934

jorisvandenbossche · 2021-08-08T14:09:04Z

From a comment of @jbrockmendel at #41878 (comment):

ser = pd.Series(range(5))
idx = pd.Index(ser)
ser[0] = 10
>>> idx[0]
10

In the above, we create an Index from a Series, then mutate the Series, which also updated the Index, while an Index is assumed to be immutable.

Changing the example a bit, you can obtain wrong values with indexing this way:

ser = pd.Series(range(5))
idx = pd.Index(ser)

ser.index = idx

>>> ser[0]
0
>>> ser.iloc[0] = 10
>>> ser[0]
10
>>> ser
10    10
1      1
2      2
3      3
4      4
dtype: int64

So ser[0] is still giving a result, while that key doesn't actually exist in the Series' index at that point.

I know that generally we consider this a user error if you would do this with a numpy array (idx = pd.Index(arr) and mutating the array), but here you get that by only using high-level pandas objects itself. In which case we should prevent this from happening?

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-08-08T14:10:45Z

A similar example but with a DataFrame and set_index (so even without explicitly doing Index(ser)) that has the same problem, as a good illustration that IMO this is not a "user error" but something we should fix on pandas' side:

In [33]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})

In [36]: df.set_index("a", drop=False, inplace=True)

In [37]: df
Out[37]: 
   a  b    c
a           
1  1  4  0.1
2  2  5  0.2
3  3  6  0.3

In [38]: df.loc[1, 'a']
Out[38]: 1

In [39]: df.iloc[0, 0] = 10

In [40]: df
Out[40]: 
     a  b    c
a             
10  10  4  0.1
2    2  5  0.2
3    3  6  0.3

In [41]: df.loc[1, 'a']
Out[41]: 10

attack68 · 2021-08-08T16:27:42Z

This probably doesn't fit in this thread but just for consideration alongside this issue is that while index is not mutable index.names is. Recently this threw me in the following case:

idx = pd.Index(["a", "b"])
df = pd.DataFrame([[1,2],[3,4]], columns=idx, index=idx)
df.index.names = ["zzz"]

zzz  a  b
zzz
a    1  2
b    3  4

I wasn't expecting the columns' names to be changed, I intuitively expected a new index object was created by the constructor.

I agree your examples should be corrected.

jorisvandenbossche · 2021-08-09T07:44:07Z

Hmm, since you are explicitly passing the same Index object as columns and rows index, this can maybe be considered as expected behaviour. Not really sure .. (but indeed a different issue).

rhshadrach · 2021-08-17T20:34:41Z

Since properties of the index are assumed immutable, they are cached, and this can lead to invalid states:

ser = pd.Series(range(2))
idx = pd.Index(ser)
idx.is_monotonic_increasing
ser[0] = 10
print(idx)
print('Is monotic increasing:', idx.is_monotonic_increasing)

gives

Int64Index([10, 1], dtype='int64')
Is monotic increasing: True

ehansis · 2021-09-15T08:17:30Z

It seems that this issue caused random SIGSEGV and SIGBUS in my code (MacOS, pandas=1.1.4=py38hcf432d8_0 from conda_forge). Unfortunately, I cannot reproduce them in a minimal example. My code looks something like this:

df = pd.DataFrame([
    [... data here ...]
], columns=["code", "value", "foo"])
df = df.set_index("code", drop=False)
df.loc["some_code", "code"] = "abc"
df.loc["some_other_code", "code"] = "def"

This modifies the index, as described above. However, if I repeat the line

df.loc["some_other_code", "code"] = "def"

once more, the assignment sometimes works (apparently re-using the previous index values) and sometimes crashes the interpreter.

rhshadrach · 2021-09-16T03:09:20Z

@ehansis - thanks for adding in here, however this appears to be a separate issue. If you add in print(id(df.index)), you should be seeing different codes, e.g.

139902856885584
139902875398688

before and after the df.loc["some_code", "code"] = "abc" line. This is because pandas is not modifying an index, but rather creating and replacing one.

ehansis · 2021-09-16T08:11:55Z

@rhshadrach OK, thanks, I'll try to check that if I ever manage to get the segmentation faults reproduced. Let me know if I can be of further help.

jorisvandenbossche · 2022-11-04T14:19:37Z

BTW, the other way around, when constructing a Series from an Index, we do take a copy to avoid such issues.
In Series.__init__:

pandas/pandas/core/series.py

Lines 401 to 409 in 57d8d3a

    
           elif isinstance(data, Index): 
        
               if dtype is not None: 
        
                   # astype copies 
        
                   data = data.astype(dtype) 
        
               else: 
        
                   # GH#24096 we need to ensure the index remains immutable 
        
                   data = data._values.copy() 
        
               copy = False

jorisvandenbossche added Bug Index Related to the Index class or subclasses Copy / view semantics labels Aug 8, 2021

jorisvandenbossche mentioned this issue Aug 8, 2021

Proof of concept for Copy-on-Write implementation #41878

Closed

1 task

rhshadrach mentioned this issue Jan 11, 2023

API/BUG: pd.concat doesn't copy indexes if with axis=1 and copy=True when they are the same #50673

Open

This was referenced Mar 13, 2023

BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364

Closed

API / BUG: copy non-Index arrays in Index construction to avoid data corruption #51930

Closed

jorisvandenbossche mentioned this issue Nov 10, 2023

CoW warning mode: add warning for single block setitem #55838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: should constructing Index from a Series make a copy? #42934

BUG: should constructing Index from a Series make a copy? #42934

jorisvandenbossche commented Aug 8, 2021

jorisvandenbossche commented Aug 8, 2021

attack68 commented Aug 8, 2021

jorisvandenbossche commented Aug 9, 2021

rhshadrach commented Aug 17, 2021

ehansis commented Sep 15, 2021

rhshadrach commented Sep 16, 2021 •

edited

Loading

ehansis commented Sep 16, 2021

jorisvandenbossche commented Nov 4, 2022

BUG: should constructing Index from a Series make a copy? #42934

BUG: should constructing Index from a Series make a copy? #42934

Comments

jorisvandenbossche commented Aug 8, 2021

jorisvandenbossche commented Aug 8, 2021

attack68 commented Aug 8, 2021

jorisvandenbossche commented Aug 9, 2021

rhshadrach commented Aug 17, 2021

ehansis commented Sep 15, 2021

rhshadrach commented Sep 16, 2021 • edited Loading

ehansis commented Sep 16, 2021

jorisvandenbossche commented Nov 4, 2022

rhshadrach commented Sep 16, 2021 •

edited

Loading