-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use lazy copy for dropna #50429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
020de93
6816c53
0c2e9a3
942f3f3
6e80af7
35cb354
560a72c
fd112a1
d48f830
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1947,9 +1947,17 @@ def _blklocs(self): | |
"""compat with BlockManager""" | ||
return None | ||
|
||
def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager: | ||
def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager: | ||
# similar to get_slice, but not restricted to slice indexer | ||
blk = self._block | ||
if ( | ||
_using_copy_on_write() | ||
and isinstance(indexer, np.ndarray) | ||
and len(indexer) > 0 | ||
and com.is_bool_indexer(indexer) | ||
and indexer.all() | ||
): | ||
return type(self)(blk, self.index, [weakref.ref(blk)], parent=self) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this special case needed here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see we only do the check higher up in DataFrame.dropna, while this one is for Series.dropna. (I am still not fully sure if I find it worth the added complexity though) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd add It for now, can remove if we don't think it's worth the complexity in the long run? |
||
array = blk._slice(indexer) | ||
if array.ndim > 1: | ||
# This will be caught by Series._get_values | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -339,6 +339,40 @@ def test_add_suffix(using_copy_on_write): | |
tm.assert_frame_equal(df, df_orig) | ||
|
||
|
||
@pytest.mark.parametrize("axis, val", [(0, 5.5), (1, np.nan)]) | ||
def test_dropna(using_copy_on_write, axis, val): | ||
df = DataFrame({"a": [1, 2, 3], "b": [4, val, 6], "c": "d"}) | ||
df_orig = df.copy() | ||
df2 = df.dropna(axis=axis) | ||
|
||
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
Comment on lines
+512
to
+515
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can also use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current one seems a bit easier to read to me and also makes it easier to identify what was hit. but not a strong opinion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is) |
||
|
||
df2.iloc[0, 0] = 0 | ||
if using_copy_on_write: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
tm.assert_frame_equal(df, df_orig) | ||
|
||
|
||
@pytest.mark.parametrize("val", [5, 5.5]) | ||
def test_dropna_series(using_copy_on_write, val): | ||
ser = Series([1, val, 4]) | ||
ser_orig = ser.copy() | ||
ser2 = ser.dropna() | ||
|
||
if using_copy_on_write: | ||
assert np.shares_memory(ser2.values, ser.values) | ||
else: | ||
assert not np.shares_memory(ser2.values, ser.values) | ||
|
||
ser2.iloc[0] = 0 | ||
if using_copy_on_write: | ||
assert not np.shares_memory(ser2.values, ser.values) | ||
tm.assert_series_equal(ser, ser_orig) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"method", | ||
[ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint is incorrect, we can also get here with an int indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this type hint is incorrect, then also the one in
Series._get_values
(since that is the only place where this is being used)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see
getitem_mgr
used inSeries._get_values
, and this in itself is only used inSeries._slice
andSeries.__getitem__
after acom.is_bool_indexer(key)
check.So then
Series._slice
should actually be used for more than a slice and its type annotation is also wrong?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems that
_convert_slice_indexer
inpandas/pandas/core/series.py
Lines 960 to 964 in 99859e4
is not guaranteed to return a slice object (only in case of DatetimeIndex, I think, through
slice_indexer
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look more closely over the next days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you are correct. We can get there via both paths, so np.ndarray is probably correct.
Update _slice as well