Skip to content

ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW #51031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 10, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Jan 27, 2023

xref #48998

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@phofl
Copy link
Member Author

phofl commented Feb 9, 2023

I think this is ready, will look into the single block case separately. The split allows us to only copy columns that are modified

@jorisvandenbossche jorisvandenbossche changed the title ENH: Avoid copying unnecessary columns in indexing for CoW ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW Feb 10, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked (previously) at the _iset_split_block logic in detail, but a few minor comments/questions

)
def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):
# Only copy column that is modified, multi-block only for now
df = DataFrame({"a": [1, 2, 3], "b": 1, "c": 1.5})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = DataFrame({"a": [1, 2, 3], "b": 1, "c": 1.5})
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})

(small nitpick, but almost every test in this file is already using that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

values = values.copy()
else:
values = values[[blk_loc]]
self._iset_split_block(blkno, [blk_loc], values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to call this for the values.ndim == 1 case that already copied the values, or does this just also does the "wrapping in block and updating self.blocks" for that case?

And _iset_split_block assumes that values is always already copied, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and Yes

Not calling it would duplicate the logic

Comment on lines 900 to 901
def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):
# Only copy column that is modified, multi-block only for now
Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):
# Only copy column that is modified, multi-block only for now
def test_set_value_copy_only_necessary_column(using_copy_on_write, method, val):
# When setting inplace, only copy column that is modified instead of the whole
# block (by splitting the block)
# TODO multi-block only for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

df_orig = df.copy()
view = df[:]

df.loc[0, "a"] = val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use method ? Hmm, but that doesn't work for setitem. So you will have to parametrize with tuples of (method, indexer), so you can do method(df)[indexer] = val?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's probably why I missed it. That's a pita...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the 2 inline suggestions, looks good to me!

@phofl phofl merged commit c74d057 into pandas-dev:main Feb 10, 2023
@phofl phofl deleted the loc_avoid_copy branch February 10, 2023 17:55
phofl added a commit to phofl/pandas that referenced this pull request Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants