Skip to content

CoW: Add reference tracking to index when created from series #51803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 15, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 5, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

cc @jbrockmendel this takes a shot at #34364 under CoW. Still a rough poc that is missing tests for the other index classes, but would appreciate if you could take a quick look. This avoids modifying an index by modifying the object it was created from.

@phofl phofl marked this pull request as draft March 5, 2023 23:26
@phofl phofl added Index Related to the Index class or subclasses Copy / view semantics labels Mar 5, 2023
Comment on lines +486 to +487
if not copy and isinstance(data, (ABCSeries, Index)):
refs = data._references
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I was also wondering if my related PR: this tackles it for Series and Index, but in theory we also have the problem with arrays:

arr = np.array([1, 2, 3])
ser = pd.Series(arr, index=arr)

And with then mutating ser, you can also trigger faulty behaviour / crashes.

So while for Index/Series this avoids a copy, we might still want to copy anyway for other array-likes (like we are going to do in the DataFrame/Series constructors)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly. But I wanted to wait for the pr that tackles this for the DataFrame case to get merged before adding this for index

@jorisvandenbossche
Copy link
Member

Currently, this doesn't yet create new BlockValuesRefs objects, right? But only store then on the Index if the input data already has refs.

For example, consider idx1 = pd.Index([1, 2, 3]); idx2 = pd.Index(idx1). Then idx1 doesn't have a ref object (because created from a list), but so also idx2 doesn't have one then?
That's fine as long as we only use for tracking Series -> Index construction. But for the Index -> Series way (creating a Series from an Index currently explicitly copies the data to avoid all those issues), if we would want to optimize that as well, we would need to start creating the ref object as well?

@jorisvandenbossche
Copy link
Member

And the approach generally certainly looks good. It's a bit unfortunate that we will need to start keeping track of this in index operations as well (considering which ones gives views etc), while it would have been nice not to have to do this (but yeah, the only option then is to always copy data when creating an Index, if we want to optimize this, this is unavoidable).

@phofl
Copy link
Member Author

phofl commented Mar 13, 2023

Yes exactly, we are doing a copy in this case right now if I am not mistaken, so I kept it out of here on purpose. But this is certainly a sensible follow up

@phofl phofl marked this pull request as ready for review March 13, 2023 18:13
@phofl
Copy link
Member Author

phofl commented Mar 13, 2023

So you'd say this is generally ok to merge?

@jorisvandenbossche
Copy link
Member

For me, yes

@phofl
Copy link
Member Author

phofl commented Mar 14, 2023

Rebasing once more

@phofl phofl added this to the 2.0 milestone Mar 15, 2023
@phofl
Copy link
Member Author

phofl commented Mar 15, 2023

merging to get started with the follow ups

@phofl phofl merged commit b02ffe2 into pandas-dev:main Mar 15, 2023
@phofl phofl deleted the cow_index_ref_tracking branch March 15, 2023 20:43
@lumberbot-app
Copy link

lumberbot-app bot commented Mar 15, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 b02ffe2673879d3d2922f89fa3d02fb59156f9ef
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #51803: CoW: Add reference tracking to index when created from series'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-51803-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51803 on branch 2.0.x (CoW: Add reference tracking to index when created from series)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.


Parameters
----------
index: object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an Index object right? (even if we can't put it in the annotation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants