Skip to content

BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rangeonnicolas opened this issue May 25, 2020 · 19 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves IO CSV read_csv, to_csv Segfault Non-Recoverable Error

Comments

@rangeonnicolas
Copy link

Hello !
I discovered an unexpected behaviour while using pandas

Code Sample

import pandas as pd

def process(df):
	df.index = df["A"]
	df.loc['first', 'A'] = 'first'

	print("Before")
	print(df)
	print()

	df.loc['first', 'B'] = 99

	print("After")
	print(df)


print("First processing : without csv file :\n")

df = pd.DataFrame({'A': ["first", "second"], 'B': [3,4]})
process(df)

print("\n\nSecond processing : via a csv file :\n")

df = pd.DataFrame({'A': ["first", "second"], 'B': [3,4]})
df.to_csv("df.csv",index=False)
df = pd.read_csv("df.csv")
process(df)

Problem description

The 2 output that process(df) produces are different if df have been previously exported to a csv file or not.

Expected output

First processing : without csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A   B
A                 
first    first  99
second  second   4


Second processing : via a csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A  B
A                   
first    first  99
second  second   4

Actual output

First processing : without csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A   B
A                 
first    first  99
second  second   4


Second processing : via a csv file :

Before
             A  B
A                
first    first  3
second  second  4

After
             A     B
A                   
first      NaN  99.0
second  second   4.0
first      NaN  99.0

Mea culpa

I guess that writing df.index = df["A"] is a bad practice because comumn A is still linked to the index. So when we modify column A, the index is changed as well. It could be good to raise a Warning to tell silly people like me that I shouldn't do that.

Thank you, and have a cute day !

show_versions.txt

@MarcoGorelli
Copy link
Member

MarcoGorelli commented May 25, 2020

If you do

df = df.set_index('A', drop=False)

, then it should work the same in both cases. Not sure why it's different in your second case, but will look into it. When I tried running it in a Jupyter notebook book, the first example gave your same output, while the line df.index = df["A"] from the second one causes the notebook to restart

Thank you, and have a cute day !

You too!

@rangeonnicolas
Copy link
Author

Very good, thanks for the reply and good luck ;)

@jbrockmendel
Copy link
Member

welp i just got a segfault trying to reproduce this

@jbrockmendel
Copy link
Member

jbrockmendel commented Sep 2, 2020

Minimal reproducer for segfault:

import pandas as pd

df = pd.DataFrame({"A": ["first", "second"], "B": [3, 4]})
df.to_csv("df.csv", index=False)
df = pd.read_csv("df.csv")

df.index = df["A"]
df.loc["first", "A"] = "first"

print(df)

df.loc["first", "B"] = 99  # <-- also segfaults with just `df.loc["first", "B"]`

@jbrockmendel jbrockmendel added the Segfault Non-Recoverable Error label Sep 2, 2020
@jbrockmendel
Copy link
Member

@realead something is going on in the ObjectHashTable, could have to do with the PYTHONSEED?

import pandas as pd

df = pd.DataFrame({"A": ["first", "second"], "B": [3, 4]})
df.to_csv("df.csv", index=False)
df = pd.read_csv("df.csv")

df.index = df["A"]
df.loc["first", "A"] = "first"

df.index.get_loc(df.index[0])  # <-- non-deterministic

On this last line I'm getting zero in some runs, segfault in others, and the incorrect KeyError below in yet others.

>>> df.index.get_loc(df.index[0])
Traceback (most recent call last):
  File "/Users/brock/Desktop/pd/pandas/pandas/core/indexes/base.py", line 3333, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
    cpdef get_loc(self, object val):
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
    return self.mapping.get_item(val)
  File "pandas/_libs/hashtable_class_helper.pxi", line 5152, in pandas._libs.hashtable.PyObjectHashTable.get_item
    cpdef get_item(self, object val):
  File "pandas/_libs/hashtable_class_helper.pxi", line 5160, in pandas._libs.hashtable.PyObjectHashTable.get_item
    raise KeyError(val)
KeyError: 'first'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/brock/Desktop/pd/pandas/pandas/core/indexes/base.py", line 3335, in get_loc
    raise KeyError(key) from err
KeyError: 'first'

@realead
Copy link
Contributor

realead commented Apr 16, 2021

@jbrockmendel I'm not able to reproduce this issue.

Maybe you can run something like

for i in `seq 2 200`
do
    sum=`expr 33304055 + $i`
    # export PYTHONHASHSEED="random"
    export PYTHONHASHSEED="$sum"
    echo "python seed: $PYTHONHASHSEED"
    python testme.py
    echo "result $?"
done

with testme.py being your example. After scanning PYTHONHASHSEED, you could find the one for which the example fails and then see, whether it always fails with this PYTHONHASHSEED-value.

@mroeschke mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves IO CSV read_csv, to_csv labels Aug 7, 2021
@roberthdevries
Copy link
Contributor

This script fails 17 out of 20 times with a segfault with the current version from today:

import pandas as pd

df = pd.DataFrame({"A": ["first", "second"], "B": [3, 4]})
df.to_csv("df.csv", index=False)
df = pd.read_csv("df.csv")

df.index = df["A"]
df.loc["first", "A"] = "1a"

df.loc["first", "A"] = "1a"     # produces core dump 17 out of 20 times

@roberthdevries
Copy link
Contributor

What looks really weird to me is that after the first time setting ["first", "A"] to "1a", both the index as the data field change, whereas I would expect that only the data field changes.
Not using the read_csv() allows you to execute df.loc["first", "A"] forever without an issue.
From this I conclude that doing df.index = df["A"] doesn't do a deep copy of the values in column "A" into the index but uses a reference. Whenever df.loc['first", "A"] receives a new value, the value used by the index is updated as well, but the original value somehow stays around as far as loc is concerned. Displaying shows value "1a" for both the index as the value for column A, and you can still use df.loc["first", "A"] although a dump of the dataframe shows this:

             A  B
A                
1a          1a  3
second  second  4

If you see the dump of the dataframe you would say that df.loc["first", "A"] should fail with a KeyError, but it doesn't. OTOH, df.loc["1a", "A"] does fail with a KeyError.

@jbrockmendel
Copy link
Member

I think I'm on to something, but will need @realead's help in actually turning it into a fix. TL;DR interning looks involved.

Three things I can do that seem to get rid of the KeyError/segfault:

  1. df.index._cache.pop("_engine") before the df.loc["first", "B"]
  2. change df.index = df["A"] to `df.index = df["A"].copy(deep=True)
  3. change df.loc["first", "A"] = "first" to df.loc["first, "A"] = df.iloc[0,0]

Before doing the roundtrip through read_csv, inspecting df.iloc[0,0] is "first" returns True (with a warning). After that roundtrip the same check returns False. So [mumble mumble] the hashtable [mumble mumble] is incorrectly missing.

@realead
Copy link
Contributor

realead commented Feb 28, 2023

@jbrockmendel That is what happens IMO (didn't debug it, just by looking at code, but it explains the the observed behavior, so the chances are high):

Starting point is IndexEngine :

cdef class IndexEngine:

    cdef readonly:
        ndarray values
        HashTable mapping

    def __init__(self, ndarray values):
        self.values = values
        self.clear_mapping()

We must be aware of two things:
* that HashTable doesn't increments/decrements PyObject pointers, so if an object is deleted from values its pointer becomes dangling in the HashTable
* if we change a value in IndexEngine.values, we need to update IndexEngine.mapping.

On the first sight, values and mapping cannot be changed from outside, so we don't have to worry, but as we don't copy values passed from outside, thus we are able to change IndexEngine.values from outside (i.e. df.loc["first", "A"] = "1a" ) without adjusting IndexEngine.mapping. And this is what happens.

That explains the following observations:

  • Without dumping/reading to/from csv, the string "A" is interned, thus we don't have a dangling pointer in IndexEngine.mapping but the values in IndexEngine.values ("1a", "second") do not match values in IndexEngine.mapping (still "first", "second"). There is no segfault, but results are not what one would expect.
  • With dumping/reading to/from cst, the string "first" in df is no longer interned and thus is deleted when df.loc["first", "A"] = "1a" is executed. The pointer in IndexEngine.mapping becomes now dangling, depending on whether the memory is reused we either get a segfault or not.
  • after index-engine is deleted from the cache (df.index._cache.pop("_engine")), it rebuild from correct values and thus we get correct results.

It probably should be:

    def __init__(self, ndarray values):
        self.values = values.copy()
        self.clear_mapping()

maybe even deepcopy is needed, but currently I could not provide an example why.

@jbrockmendel
Copy link
Member

@jorisvandenbossche @phofl CoW should make it somewhat harder for a user to modify an Index's values by accident. Not sure what else we can do here since an extra copy is pretty heavy

@phofl
Copy link
Member

phofl commented Mar 5, 2023

This shouldn’t be possible anymore under copy on write.

doing something like df.index = df.A
sets up a reference to the column, so updating the column triggers a copy.

we should add a test about this though

Edit: My comment explains what should happen but this is buggy... Will look into it

@jorisvandenbossche
Copy link
Member

doing something like df.index = df.A

I would say we should consider it a bug that this doesn't create a copy of the data of "A" when creating the index. This is essentialy the same issue as #42934 about the Index(series) constructor not creating a copy, and leading to wrong behaviour (and potentially crashes).

With CoW we could indeed avoid the copy if Index is included in the mechanism (#51803), but on the short term for the default behaviour, I think we should consider copying the data.

@jorisvandenbossche
Copy link
Member

I opened a draft PR trying out the idea of always copying (except if it's already an index) in #51930, which would fix this for the default path (the CoW work of Patrick could then optimize this by again avoiding some of those copies when CoW is enabled).

@jorisvandenbossche jorisvandenbossche changed the title Unexpected behaviour BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified Mar 13, 2023
@jbrockmendel
Copy link
Member

@seberg occasionally ill somehow find myself with a read-only np.ndarray that is so very read-only that I can't modify the flags or create a writeable view. Is there any way to do this intentionally?

@seberg
Copy link
Contributor

seberg commented May 3, 2023

@seberg occasionally ill somehow find myself with a read-only np.ndarray that is so very read-only that I can't modify the flags or create a writeable view. Is there any way to do this intentionally?

Haha, yes:

def make_really_really_readonly(arr):
    if arr.base is None:
        # create a new view it is weird that this actually is enough!
        arr = arr[...]

    base = arr
    while isinstance(base, np.ndarray):
        base.flags.writeable = False
        base = arr.base

    return arr

I.e. you need a NumPy array with no writeable array ancestor, otherwise NumPy will be sloppy and say: Well, clearly that can be written to.

Another way, would be if arr.base is not an array at all (which might be more sane).

What is a bit odd here, is that if you have an array without a base and set it to writeonly, it can be set back, but if you insert one view, you would have to start with the base. TBH, I doubt that is intended...

In any case, I doubt this is what you want. Rather, you probably have to use a hack that goes via __array_interface__ (or similar) so that the base object is not an array.

@seberg
Copy link
Contributor

seberg commented May 3, 2023

(Sorry, while that explains maybe the why/when, it probably doesn't give a reasonable/good path to get such an array. If you need that maybe we should add something in NumPy?)

@tylerjereddy
Copy link
Contributor

This hit me today for column mutation (the actual workflow from the work project was not as contrived):

import numpy as np
import pandas as pd

data = np.ones((2, 9), dtype=np.float64)
df = pd.DataFrame(data=data, columns=["A"] + ["0"] * 8)
df.to_excel("test.xlsx")
df = pd.read_excel("test.xlsx")
df = df.merge(df, on=["A"])
df.columns.values[3:] = 0
print(df) # even the print is needed to caush the crash...

On main branch it is "ok" though:

Traceback (most recent call last):
  File "/tmp/repro.py", line 9, in <module>
    df.columns.values[3:] = 0
    ~~~~~~~~~~~~~~~~~^^^^
ValueError: assignment destination is read-only

@phofl
Copy link
Member

phofl commented Feb 23, 2024

Yes copy on write protects against this except if you very intentionally want to modify the index values and actively circumvent the rules, which is something we can’t protect against completely

Closing here

@phofl phofl closed this as completed Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves IO CSV read_csv, to_csv Segfault Non-Recoverable Error
Projects
None yet
10 participants