Skip to content

ENH: Use lazy copy for dropna #50429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 13, 2023
Merged

ENH: Use lazy copy for dropna #50429

merged 9 commits into from
Jan 13, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 24, 2022

@@ -1947,9 +1947,17 @@ def _blklocs(self):
"""compat with BlockManager"""
return None

def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager:
def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint is incorrect, we can also get here with an int indexer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this type hint is incorrect, then also the one in Series._get_values (since that is the only place where this is being used)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see getitem_mgr used in Series._get_values, and this in itself is only used in Series._slice and Series.__getitem__ after a com.is_bool_indexer(key) check.
So then Series._slice should actually be used for more than a slice and its type annotation is also wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems that _convert_slice_indexer in

if isinstance(key, slice):
# _convert_slice_indexer to determine if this slice is positional
# or label based, and if the latter, convert to positional
slobj = self.index._convert_slice_indexer(key, kind="getitem")
return self._slice(slobj)

is not guaranteed to return a slice object (only in case of DatetimeIndex, I think, through slice_indexer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look more closely over the next days

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you are correct. We can get there via both paths, so np.ndarray is probably correct.

Update _slice as well

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Whatsnew note?

@phofl
Copy link
Member Author

phofl commented Dec 27, 2022

Same as align, did not add notes for any of those already merged

Comment on lines +301 to +304
if using_copy_on_write:
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) is using_copy_on_write. To me seems a bit clearer, but feel free to leave it as is if you find it more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current one seems a bit easier to read to me and also makes it easier to identify what was hit. but not a strong opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is)

@@ -1947,9 +1947,17 @@ def _blklocs(self):
"""compat with BlockManager"""
return None

def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager:
def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this type hint is incorrect, then also the one in Series._get_values (since that is the only place where this is being used)

and com.is_bool_indexer(indexer)
and indexer.all()
):
return type(self)(blk, self.index, [weakref.ref(blk)], parent=self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this special case needed here?
For example for dropna itself, this check is already done on a level higher up. And so doing that here again might mean the check is done multiple times?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see we only do the check higher up in DataFrame.dropna, while this one is for Series.dropna.

(I am still not fully sure if I find it worth the added complexity though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add It for now, can remove if we don't think it's worth the complexity in the long run?

Comment on lines +301 to +304
if using_copy_on_write:
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is)

@phofl phofl merged commit b37589a into pandas-dev:main Jan 13, 2023
@phofl phofl deleted the cow_dropna branch January 13, 2023 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants