Skip to content

ENH: Use lazy copy for dropna #50429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 13, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6329,7 +6329,7 @@ def dropna(
raise ValueError(f"invalid how option: {how}")

if np.all(mask):
result = self.copy()
result = self.copy(deep=None)
else:
result = self.loc(axis=axis)[mask]

Expand Down
10 changes: 9 additions & 1 deletion pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1947,9 +1947,17 @@ def _blklocs(self):
"""compat with BlockManager"""
return None

def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager:
def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint is incorrect, we can also get here with an int indexer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this type hint is incorrect, then also the one in Series._get_values (since that is the only place where this is being used)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see getitem_mgr used in Series._get_values, and this in itself is only used in Series._slice and Series.__getitem__ after a com.is_bool_indexer(key) check.
So then Series._slice should actually be used for more than a slice and its type annotation is also wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems that _convert_slice_indexer in

if isinstance(key, slice):
# _convert_slice_indexer to determine if this slice is positional
# or label based, and if the latter, convert to positional
slobj = self.index._convert_slice_indexer(key, kind="getitem")
return self._slice(slobj)

is not guaranteed to return a slice object (only in case of DatetimeIndex, I think, through slice_indexer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look more closely over the next days

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you are correct. We can get there via both paths, so np.ndarray is probably correct.

Update _slice as well

# similar to get_slice, but not restricted to slice indexer
blk = self._block
if (
_using_copy_on_write()
and isinstance(indexer, np.ndarray)
and len(indexer) > 0
and com.is_bool_indexer(indexer)
and indexer.all()
):
return type(self)(blk, self.index, [weakref.ref(blk)], parent=self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this special case needed here?
For example for dropna itself, this check is already done on a level higher up. And so doing that here again might mean the check is done multiple times?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see we only do the check higher up in DataFrame.dropna, while this one is for Series.dropna.

(I am still not fully sure if I find it worth the added complexity though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add It for now, can remove if we don't think it's worth the complexity in the long run?

array = blk._slice(indexer)
if array.ndim > 1:
# This will be caught by Series._get_values
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -5564,7 +5564,7 @@ def dropna(
return result
else:
if not inplace:
return self.copy()
return self.copy(deep=None)
return None

# ----------------------------------------------------------------------
Expand Down
34 changes: 34 additions & 0 deletions pandas/tests/copy_view/test_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,40 @@ def test_add_suffix(using_copy_on_write):
tm.assert_frame_equal(df, df_orig)


@pytest.mark.parametrize("axis, val", [(0, 5.5), (1, np.nan)])
def test_dropna(using_copy_on_write, axis, val):
df = DataFrame({"a": [1, 2, 3], "b": [4, val, 6], "c": "d"})
df_orig = df.copy()
df2 = df.dropna(axis=axis)

if using_copy_on_write:
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
Comment on lines +512 to +515
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) is using_copy_on_write. To me seems a bit clearer, but feel free to leave it as is if you find it more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current one seems a bit easier to read to me and also makes it easier to identify what was hit. but not a strong opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is)


df2.iloc[0, 0] = 0
if using_copy_on_write:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
tm.assert_frame_equal(df, df_orig)


@pytest.mark.parametrize("val", [5, 5.5])
def test_dropna_series(using_copy_on_write, val):
ser = Series([1, val, 4])
ser_orig = ser.copy()
ser2 = ser.dropna()

if using_copy_on_write:
assert np.shares_memory(ser2.values, ser.values)
else:
assert not np.shares_memory(ser2.values, ser.values)

ser2.iloc[0] = 0
if using_copy_on_write:
assert not np.shares_memory(ser2.values, ser.values)
tm.assert_series_equal(ser, ser_orig)


@pytest.mark.parametrize(
"method",
[
Expand Down