-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified #34364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you do df = df.set_index('A', drop=False) , then it should work the same in both cases. Not sure why it's different in your second case, but will look into it. When I tried running it in a Jupyter notebook book, the first example gave your same output, while the line
You too! |
Very good, thanks for the reply and good luck ;) |
welp i just got a segfault trying to reproduce this |
Minimal reproducer for segfault:
|
@realead something is going on in the ObjectHashTable, could have to do with the PYTHONSEED?
On this last line I'm getting zero in some runs, segfault in others, and the incorrect KeyError below in yet others.
|
@jbrockmendel I'm not able to reproduce this issue. Maybe you can run something like
with |
This script fails 17 out of 20 times with a segfault with the current version from today:
|
What looks really weird to me is that after the first time setting
If you see the dump of the dataframe you would say that |
I think I'm on to something, but will need @realead's help in actually turning it into a fix. TL;DR interning looks involved. Three things I can do that seem to get rid of the KeyError/segfault:
Before doing the roundtrip through read_csv, inspecting |
@jbrockmendel That is what happens IMO (didn't debug it, just by looking at code, but it explains the the observed behavior, so the chances are high): Starting point is
We must be aware of two things: On the first sight, That explains the following observations:
It probably should be:
maybe even deepcopy is needed, but currently I could not provide an example why. |
@jorisvandenbossche @phofl CoW should make it somewhat harder for a user to modify an Index's values by accident. Not sure what else we can do here since an extra copy is pretty heavy |
This shouldn’t be possible anymore under copy on write. doing something like df.index = df.A we should add a test about this though Edit: My comment explains what should happen but this is buggy... Will look into it |
I would say we should consider it a bug that this doesn't create a copy of the data of "A" when creating the index. This is essentialy the same issue as #42934 about the With CoW we could indeed avoid the copy if Index is included in the mechanism (#51803), but on the short term for the default behaviour, I think we should consider copying the data. |
I opened a draft PR trying out the idea of always copying (except if it's already an index) in #51930, which would fix this for the default path (the CoW work of Patrick could then optimize this by again avoiding some of those copies when CoW is enabled). |
@seberg occasionally ill somehow find myself with a read-only np.ndarray that is so very read-only that I can't modify the flags or create a writeable view. Is there any way to do this intentionally? |
Haha, yes:
I.e. you need a NumPy array with no writeable array ancestor, otherwise NumPy will be sloppy and say: Well, clearly that can be written to. Another way, would be if What is a bit odd here, is that if you have an array without a base and set it to writeonly, it can be set back, but if you insert one view, you would have to start with the base. TBH, I doubt that is intended... In any case, I doubt this is what you want. Rather, you probably have to use a hack that goes via |
(Sorry, while that explains maybe the why/when, it probably doesn't give a reasonable/good path to get such an array. If you need that maybe we should add something in NumPy?) |
This hit me today for column mutation (the actual workflow from the work project was not as contrived): import numpy as np
import pandas as pd
data = np.ones((2, 9), dtype=np.float64)
df = pd.DataFrame(data=data, columns=["A"] + ["0"] * 8)
df.to_excel("test.xlsx")
df = pd.read_excel("test.xlsx")
df = df.merge(df, on=["A"])
df.columns.values[3:] = 0
print(df) # even the print is needed to caush the crash... On
|
Yes copy on write protects against this except if you very intentionally want to modify the index values and actively circumvent the rules, which is something we can’t protect against completely Closing here |
Hello !
I discovered an unexpected behaviour while using pandas
Code Sample
Problem description
The 2 output that
process(df)
produces are different ifdf
have been previously exported to a csv file or not.Expected output
Actual output
Mea culpa
I guess that writing
df.index = df["A"]
is a bad practice because comumnA
is still linked to the index. So when we modify columnA
, the index is changed as well. It could be good to raise a Warning to tell silly people like me that I shouldn't do that.Thank you, and have a cute day !
show_versions.txt
The text was updated successfully, but these errors were encountered: