BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364

rangeonnicolas · 2020-05-25T11:58:18Z

Hello !
I discovered an unexpected behaviour while using pandas

Code Sample

import pandas as pd

def process(df):
	df.index = df["A"]
	df.loc['first', 'A'] = 'first'

	print("Before")
	print(df)
	print()

	df.loc['first', 'B'] = 99

	print("After")
	print(df)


print("First processing : without csv file :\n")

df = pd.DataFrame({'A': ["first", "second"], 'B': [3,4]})
process(df)

print("\n\nSecond processing : via a csv file :\n")

df = pd.DataFrame({'A': ["first", "second"], 'B': [3,4]})
df.to_csv("df.csv",index=False)
df = pd.read_csv("df.csv")
process(df)

Problem description

The 2 output that process(df) produces are different if df have been previously exported to a csv file or not.

Expected output

First processing : without csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A   B
A                 
first    first  99
second  second   4


Second processing : via a csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A  B
A                   
first    first  99
second  second   4

Actual output

First processing : without csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A   B
A                 
first    first  99
second  second   4


Second processing : via a csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A     B
A                   
first      NaN  99.0
second  second   4.0
first      NaN  99.0

Mea culpa

I guess that writing df.index = df["A"] is a bad practice because comumn A is still linked to the index. So when we modify column A, the index is changed as well. It could be good to raise a Warning to tell silly people like me that I shouldn't do that.

Thank you, and have a cute day !

show_versions.txt

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-05-25T20:06:45Z

If you do

df = df.set_index('A', drop=False)

, then it should work the same in both cases. Not sure why it's different in your second case, but will look into it. When I tried running it in a Jupyter notebook book, the first example gave your same output, while the line df.index = df["A"] from the second one causes the notebook to restart

Thank you, and have a cute day !

You too!

rangeonnicolas · 2020-05-25T21:08:11Z

Very good, thanks for the reply and good luck ;)

jbrockmendel · 2020-09-02T22:51:02Z

welp i just got a segfault trying to reproduce this

jbrockmendel · 2020-09-02T22:56:17Z

Minimal reproducer for segfault:

import pandas as pd

df = pd.DataFrame({"A": ["first", "second"], "B": [3, 4]})
df.to_csv("df.csv", index=False)
df = pd.read_csv("df.csv")

df.index = df["A"]
df.loc["first", "A"] = "first"

print(df)

df.loc["first", "B"] = 99  # <-- also segfaults with just `df.loc["first", "B"]`

jbrockmendel · 2021-04-15T20:52:31Z

@realead something is going on in the ObjectHashTable, could have to do with the PYTHONSEED?

import pandas as pd

df = pd.DataFrame({"A": ["first", "second"], "B": [3, 4]})
df.to_csv("df.csv", index=False)
df = pd.read_csv("df.csv")

df.index = df["A"]
df.loc["first", "A"] = "first"

df.index.get_loc(df.index[0])  # <-- non-deterministic

On this last line I'm getting zero in some runs, segfault in others, and the incorrect KeyError below in yet others.

>>> df.index.get_loc(df.index[0])
Traceback (most recent call last):
  File "/Users/brock/Desktop/pd/pandas/pandas/core/indexes/base.py", line 3333, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
    cpdef get_loc(self, object val):
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
    return self.mapping.get_item(val)
  File "pandas/_libs/hashtable_class_helper.pxi", line 5152, in pandas._libs.hashtable.PyObjectHashTable.get_item
    cpdef get_item(self, object val):
  File "pandas/_libs/hashtable_class_helper.pxi", line 5160, in pandas._libs.hashtable.PyObjectHashTable.get_item
    raise KeyError(val)
KeyError: 'first'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/brock/Desktop/pd/pandas/pandas/core/indexes/base.py", line 3335, in get_loc
    raise KeyError(key) from err
KeyError: 'first'

realead · 2021-04-16T14:35:54Z

@jbrockmendel I'm not able to reproduce this issue.

Maybe you can run something like

for i in `seq 2 200`
do
    sum=`expr 33304055 + $i`
    # export PYTHONHASHSEED="random"
    export PYTHONHASHSEED="$sum"
    echo "python seed: $PYTHONHASHSEED"
    python testme.py
    echo "result $?"
done

with testme.py being your example. After scanning PYTHONHASHSEED, you could find the one for which the example fails and then see, whether it always fails with this PYTHONHASHSEED-value.

roberthdevries · 2023-01-22T13:47:17Z

This script fails 17 out of 20 times with a segfault with the current version from today:

import pandas as pd

df = pd.DataFrame({"A": ["first", "second"], "B": [3, 4]})
df.to_csv("df.csv", index=False)
df = pd.read_csv("df.csv")

df.index = df["A"]
df.loc["first", "A"] = "1a"

df.loc["first", "A"] = "1a"     # produces core dump 17 out of 20 times

roberthdevries · 2023-01-22T14:09:41Z

What looks really weird to me is that after the first time setting ["first", "A"] to "1a", both the index as the data field change, whereas I would expect that only the data field changes.
Not using the read_csv() allows you to execute df.loc["first", "A"] forever without an issue.
From this I conclude that doing df.index = df["A"] doesn't do a deep copy of the values in column "A" into the index but uses a reference. Whenever df.loc['first", "A"] receives a new value, the value used by the index is updated as well, but the original value somehow stays around as far as loc is concerned. Displaying shows value "1a" for both the index as the value for column A, and you can still use df.loc["first", "A"] although a dump of the dataframe shows this:

             A  B
A                
1a          1a  3
second  second  4

If you see the dump of the dataframe you would say that df.loc["first", "A"] should fail with a KeyError, but it doesn't. OTOH, df.loc["1a", "A"] does fail with a KeyError.

jbrockmendel · 2023-02-27T22:24:04Z

I think I'm on to something, but will need @realead's help in actually turning it into a fix. TL;DR interning looks involved.

Three things I can do that seem to get rid of the KeyError/segfault:

df.index._cache.pop("_engine") before the df.loc["first", "B"]
change df.index = df["A"] to `df.index = df["A"].copy(deep=True)
change df.loc["first", "A"] = "first" to df.loc["first, "A"] = df.iloc[0,0]

Before doing the roundtrip through read_csv, inspecting df.iloc[0,0] is "first" returns True (with a warning). After that roundtrip the same check returns False. So [mumble mumble] the hashtable [mumble mumble] is incorrectly missing.

realead · 2023-02-28T08:40:27Z

@jbrockmendel That is what happens IMO (didn't debug it, just by looking at code, but it explains the the observed behavior, so the chances are high):

Starting point is IndexEngine :

cdef class IndexEngine:

    cdef readonly:
        ndarray values
        HashTable mapping

    def __init__(self, ndarray values):
        self.values = values
        self.clear_mapping()

We must be aware of two things:
* that HashTable doesn't increments/decrements PyObject pointers, so if an object is deleted from values its pointer becomes dangling in the HashTable
* if we change a value in IndexEngine.values, we need to update IndexEngine.mapping.

On the first sight, values and mapping cannot be changed from outside, so we don't have to worry, but as we don't copy values passed from outside, thus we are able to change IndexEngine.values from outside (i.e. df.loc["first", "A"] = "1a" ) without adjusting IndexEngine.mapping. And this is what happens.

That explains the following observations:

Without dumping/reading to/from csv, the string "A" is interned, thus we don't have a dangling pointer in IndexEngine.mapping but the values in IndexEngine.values ("1a", "second") do not match values in IndexEngine.mapping (still "first", "second"). There is no segfault, but results are not what one would expect.
With dumping/reading to/from cst, the string "first" in df is no longer interned and thus is deleted when df.loc["first", "A"] = "1a" is executed. The pointer in IndexEngine.mapping becomes now dangling, depending on whether the memory is reused we either get a segfault or not.
after index-engine is deleted from the cache (df.index._cache.pop("_engine")), it rebuild from correct values and thus we get correct results.

It probably should be:

    def __init__(self, ndarray values):
        self.values = values.copy()
        self.clear_mapping()

maybe even deepcopy is needed, but currently I could not provide an example why.

jbrockmendel · 2023-03-05T00:25:31Z

@jorisvandenbossche @phofl CoW should make it somewhat harder for a user to modify an Index's values by accident. Not sure what else we can do here since an extra copy is pretty heavy

phofl · 2023-03-05T00:28:58Z

This shouldn’t be possible anymore under copy on write.

doing something like df.index = df.A
sets up a reference to the column, so updating the column triggers a copy.

we should add a test about this though

Edit: My comment explains what should happen but this is buggy... Will look into it

jorisvandenbossche · 2023-03-13T14:13:34Z

doing something like df.index = df.A

I would say we should consider it a bug that this doesn't create a copy of the data of "A" when creating the index. This is essentialy the same issue as #42934 about the Index(series) constructor not creating a copy, and leading to wrong behaviour (and potentially crashes).

With CoW we could indeed avoid the copy if Index is included in the mechanism (#51803), but on the short term for the default behaviour, I think we should consider copying the data.

jorisvandenbossche · 2023-03-13T15:14:47Z

I opened a draft PR trying out the idea of always copying (except if it's already an index) in #51930, which would fix this for the default path (the CoW work of Patrick could then optimize this by again avoiding some of those copies when CoW is enabled).

jbrockmendel · 2023-05-03T17:20:52Z

@seberg occasionally ill somehow find myself with a read-only np.ndarray that is so very read-only that I can't modify the flags or create a writeable view. Is there any way to do this intentionally?

seberg · 2023-05-03T18:07:19Z

@seberg occasionally ill somehow find myself with a read-only np.ndarray that is so very read-only that I can't modify the flags or create a writeable view. Is there any way to do this intentionally?

Haha, yes:

def make_really_really_readonly(arr):
    if arr.base is None:
        # create a new view it is weird that this actually is enough!
        arr = arr[...]

    base = arr
    while isinstance(base, np.ndarray):
        base.flags.writeable = False
        base = arr.base

    return arr

I.e. you need a NumPy array with no writeable array ancestor, otherwise NumPy will be sloppy and say: Well, clearly that can be written to.

Another way, would be if arr.base is not an array at all (which might be more sane).

What is a bit odd here, is that if you have an array without a base and set it to writeonly, it can be set back, but if you insert one view, you would have to start with the base. TBH, I doubt that is intended...

In any case, I doubt this is what you want. Rather, you probably have to use a hack that goes via __array_interface__ (or similar) so that the base object is not an array.

seberg · 2023-05-03T18:21:16Z

(Sorry, while that explains maybe the why/when, it probably doesn't give a reasonable/good path to get such an array. If you need that maybe we should add something in NumPy?)

tylerjereddy · 2024-02-23T00:12:23Z

This hit me today for column mutation (the actual workflow from the work project was not as contrived):

import numpy as np
import pandas as pd

data = np.ones((2, 9), dtype=np.float64)
df = pd.DataFrame(data=data, columns=["A"] + ["0"] * 8)
df.to_excel("test.xlsx")
df = pd.read_excel("test.xlsx")
df = df.merge(df, on=["A"])
df.columns.values[3:] = 0
print(df) # even the print is needed to caush the crash...

On main branch it is "ok" though:

Traceback (most recent call last):
  File "/tmp/repro.py", line 9, in <module>
    df.columns.values[3:] = 0
    ~~~~~~~~~~~~~~~~~^^^^
ValueError: assignment destination is read-only

phofl · 2024-02-23T06:31:36Z

Yes copy on write protects against this except if you very intentionally want to modify the index values and actively circumvent the rules, which is something we can’t protect against completely

Closing here

jbrockmendel added the Segfault Non-Recoverable Error label Sep 2, 2020

mzeitlin11 mentioned this issue Mar 19, 2021

BUG: read_excel leads to segfault #40321

Closed

1 task

mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves IO CSV read_csv, to_csv labels Aug 7, 2021

phofl mentioned this issue Mar 5, 2023

CoW: Add reference tracking to index when created from series #51803

Merged

5 tasks

jorisvandenbossche mentioned this issue Mar 13, 2023

API / BUG: copy non-Index arrays in Index construction to avoid data corruption #51930

Closed

5 tasks

jorisvandenbossche changed the title ~~Unexpected behaviour~~ BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified Mar 13, 2023

phofl closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364

BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364

rangeonnicolas commented May 25, 2020

MarcoGorelli commented May 25, 2020 •

edited

Loading

rangeonnicolas commented May 25, 2020

jbrockmendel commented Sep 2, 2020

jbrockmendel commented Sep 2, 2020 •

edited

Loading

jbrockmendel commented Apr 15, 2021

realead commented Apr 16, 2021

roberthdevries commented Jan 22, 2023

roberthdevries commented Jan 22, 2023

jbrockmendel commented Feb 27, 2023

realead commented Feb 28, 2023

jbrockmendel commented Mar 5, 2023

phofl commented Mar 5, 2023 •

edited

Loading

jorisvandenbossche commented Mar 13, 2023

jorisvandenbossche commented Mar 13, 2023

jbrockmendel commented May 3, 2023

seberg commented May 3, 2023

seberg commented May 3, 2023

tylerjereddy commented Feb 23, 2024

phofl commented Feb 23, 2024

BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364

BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364

Comments

rangeonnicolas commented May 25, 2020

Code Sample

Problem description

Expected output

Actual output

Mea culpa

MarcoGorelli commented May 25, 2020 • edited Loading

rangeonnicolas commented May 25, 2020

jbrockmendel commented Sep 2, 2020

jbrockmendel commented Sep 2, 2020 • edited Loading

jbrockmendel commented Apr 15, 2021

realead commented Apr 16, 2021

roberthdevries commented Jan 22, 2023

roberthdevries commented Jan 22, 2023

jbrockmendel commented Feb 27, 2023

realead commented Feb 28, 2023

jbrockmendel commented Mar 5, 2023

phofl commented Mar 5, 2023 • edited Loading

jorisvandenbossche commented Mar 13, 2023

jorisvandenbossche commented Mar 13, 2023

jbrockmendel commented May 3, 2023

seberg commented May 3, 2023

seberg commented May 3, 2023

tylerjereddy commented Feb 23, 2024

phofl commented Feb 23, 2024

MarcoGorelli commented May 25, 2020 •

edited

Loading

jbrockmendel commented Sep 2, 2020 •

edited

Loading

phofl commented Mar 5, 2023 •

edited

Loading