Skip to content

API / BUG: copy non-Index arrays in Index construction to avoid data corruption #51930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

Would close #34364 and #42934

The idea here is that since the Index is assumed to be immutable and caches things like the hashtable, we should avoid that a user can actually mutate those values through normal pandas functionality by making a copy when creating an Index from array-like data.
I made the exception for an object that is already an Index (since this is immutable, you can't mutate this through normal pandas functionality, and typically this will already made a copy when first being created). This should avoid too many copies on repeated ensure_index calls.

Still need to add more tests (covering the segfault case), docs, and do the same in the Index subclasses' __new__, in case we want something like this.

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jorisvandenbossche
Copy link
Member Author

@phofl I should have first looked at the tests you already added in #51803 .., but so this might be duplicating part of the ones you also wrote (and if we want to do this, I can probably take some of the tests you added there)

@jorisvandenbossche jorisvandenbossche added Index Related to the Index class or subclasses Copy / view semantics labels Mar 13, 2023
@jbrockmendel
Copy link
Member

Do we want to distinguish priority-wise between the segfaults generated in #34364 vs more general incorrect-but-not-segfaulting cases that can occur when a user modifies index._values?

The general case I think is hard-bordering-on-impossible unless we make a copy in the index.pyx code that the user cannot access. Unless we're willing to go that far, my inclination is to avoid making extra copies to avoid the perf hit since this is a reasonably rare problem (that becomes even more rare with CoW IIUC).

For the segfault I wonder if we can use a hashing mechanism that doesn't depend on interning?

@jorisvandenbossche
Copy link
Member Author

Do you have an idea how to fix just the segfault? How easy / feasible is the "use a hashing mechanism that doesn't depend on interning" idea? (I am not familiar with our hashing code, so genuinely don't know)

And I also think the incorrect behaviour is quite problematic. Actually, with a segfault you know something went wrong and you need to fix "something" (although finding out the "something" can be hard in this case, since you don't have a clear error). While with the incorrect behaviour, you just can get silently wrong results in your analysis.

The general case I think is hard-bordering-on-impossible unless we make a copy in the index.pyx code that the user cannot access.

I am doing the copy here on Index construction, to avoid doing a copy when creating the engine (which I think decreases the overall number of extra copies)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 15, 2023
@simonjayhawkins
Copy link
Member

@jorisvandenbossche what's the status here?

@phofl
Copy link
Member

phofl commented Feb 6, 2024

We can close, that’s not an issue anymore with copy on write In 3.0

@phofl phofl closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Index Related to the Index class or subclasses Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: wrong behaviour / segfaults when data from Index have been unintentionally modified
4 participants