-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use lazy copy for dropna #50429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1947,9 +1947,17 @@ def _blklocs(self): | |||
"""compat with BlockManager""" | |||
return None | |||
|
|||
def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager: | |||
def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint is incorrect, we can also get here with an int indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this type hint is incorrect, then also the one in Series._get_values
(since that is the only place where this is being used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see getitem_mgr
used in Series._get_values
, and this in itself is only used in Series._slice
and Series.__getitem__
after a com.is_bool_indexer(key)
check.
So then Series._slice
should actually be used for more than a slice and its type annotation is also wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems that _convert_slice_indexer
in
Lines 960 to 964 in 99859e4
if isinstance(key, slice): | |
# _convert_slice_indexer to determine if this slice is positional | |
# or label based, and if the latter, convert to positional | |
slobj = self.index._convert_slice_indexer(key, kind="getitem") | |
return self._slice(slobj) |
is not guaranteed to return a slice object (only in case of DatetimeIndex, I think, through slice_indexer
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look more closely over the next days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you are correct. We can get there via both paths, so np.ndarray is probably correct.
Update _slice as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Whatsnew note?
Same as align, did not add notes for any of those already merged |
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) is using_copy_on_write
. To me seems a bit clearer, but feel free to leave it as is if you find it more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current one seems a bit easier to read to me and also makes it easier to identify what was hit. but not a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is)
@@ -1947,9 +1947,17 @@ def _blklocs(self): | |||
"""compat with BlockManager""" | |||
return None | |||
|
|||
def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager: | |||
def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this type hint is incorrect, then also the one in Series._get_values
(since that is the only place where this is being used)
and com.is_bool_indexer(indexer) | ||
and indexer.all() | ||
): | ||
return type(self)(blk, self.index, [weakref.ref(blk)], parent=self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this special case needed here?
For example for dropna
itself, this check is already done on a level higher up. And so doing that here again might mean the check is done multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see we only do the check higher up in DataFrame.dropna, while this one is for Series.dropna.
(I am still not fully sure if I find it worth the added complexity though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add It for now, can remove if we don't think it's worth the complexity in the long run?
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is)
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.