ENH: Use lazy copy for dropna #50429

phofl · 2022-12-24T14:31:26Z

xref ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

phofl · 2022-12-24T17:48:43Z

pandas/core/internals/managers.py

@@ -1947,9 +1947,17 @@ def _blklocs(self):
        """compat with BlockManager"""
        return None

-    def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager:
+    def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager:


The type hint is incorrect, we can also get here with an int indexer

If this type hint is incorrect, then also the one in Series._get_values (since that is the only place where this is being used)

I only see getitem_mgr used in Series._get_values, and this in itself is only used in Series._slice and Series.__getitem__ after a com.is_bool_indexer(key) check.
So then Series._slice should actually be used for more than a slice and its type annotation is also wrong?

Yes, it seems that _convert_slice_indexer in

pandas/pandas/core/series.py

Lines 960 to 964 in 99859e4

if isinstance(key, slice):

# _convert_slice_indexer to determine if this slice is positional

# or label based, and if the latter, convert to positional

slobj = self.index._convert_slice_indexer(key, kind="getitem")

return self._slice(slobj)

is not guaranteed to return a slice object (only in case of DatetimeIndex, I think, through slice_indexer)

Will look more closely over the next days

Yep you are correct. We can get there via both paths, so np.ndarray is probably correct.

Update _slice as well

mroeschke

Looks good. Whatsnew note?

phofl · 2022-12-27T19:36:29Z

Same as align, did not add notes for any of those already merged

datapythonista · 2022-12-29T09:13:48Z

pandas/tests/copy_view/test_methods.py

+    if using_copy_on_write:
+        assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
+    else:
+        assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))


You can also use assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) is using_copy_on_write. To me seems a bit clearer, but feel free to leave it as is if you find it more readable.

The current one seems a bit easier to read to me and also makes it easier to identify what was hit. but not a strong opinion.

It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is)

jorisvandenbossche · 2023-01-03T07:14:20Z

pandas/core/internals/managers.py

@@ -1947,9 +1947,17 @@ def _blklocs(self):
        """compat with BlockManager"""
        return None

-    def getitem_mgr(self, indexer: slice | npt.NDArray[np.bool_]) -> SingleBlockManager:
+    def getitem_mgr(self, indexer: slice | np.ndarray) -> SingleBlockManager:


If this type hint is incorrect, then also the one in Series._get_values (since that is the only place where this is being used)

jorisvandenbossche · 2023-01-03T07:17:46Z

pandas/core/internals/managers.py

+            and com.is_bool_indexer(indexer)
+            and indexer.all()
+        ):
+            return type(self)(blk, self.index, [weakref.ref(blk)], parent=self)


Is this special case needed here?
For example for dropna itself, this check is already done on a level higher up. And so doing that here again might mean the check is done multiple times?

Ah, I see we only do the check higher up in DataFrame.dropna, while this one is for Series.dropna.

(I am still not fully sure if I find it worth the added complexity though)

I'd add It for now, can remove if we don't think it's worth the complexity in the long run?

jorisvandenbossche · 2023-01-03T07:23:34Z

pandas/tests/copy_view/test_methods.py

+    if using_copy_on_write:
+        assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
+    else:
+        assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))


It's also consistent with how all other tests have already been written, so if we want to change it, let's do it in a separate PR for the full file (but personally, I would also keep as is)

phofl added 4 commits December 24, 2022 15:30

Use lazy copy for dropna

020de93

Add support for series

6816c53

Fix tests

0c2e9a3

Fix tests

942f3f3

phofl commented Dec 24, 2022

View reviewed changes

phofl mentioned this pull request Dec 24, 2022

ENH: Add lazy copy for drop duplicates #50431

Merged

5 tasks

mroeschke added the Copy / view semantics label Dec 27, 2022

mroeschke reviewed Dec 27, 2022

View reviewed changes

phofl mentioned this pull request Dec 28, 2022

DOC: Add note about copy on write improvements #50471

Merged

5 tasks

datapythonista reviewed Dec 29, 2022

View reviewed changes

Merge branch 'main' into cow_dropna

6e80af7

jorisvandenbossche reviewed Jan 3, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Jan 3, 2023

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Closed

73 tasks

phofl added this to the 2.0 milestone Jan 6, 2023

phofl and others added 4 commits January 12, 2023 17:00

Merge branch 'main' into cow_dropna

35cb354

Fix ref

560a72c

Merge branch 'main' into cow_dropna

fd112a1

Fix type hint

d48f830

jorisvandenbossche approved these changes Jan 13, 2023

View reviewed changes

phofl merged commit b37589a into pandas-dev:main Jan 13, 2023

phofl deleted the cow_dropna branch January 13, 2023 16:48

	if isinstance(key, slice):
	# _convert_slice_indexer to determine if this slice is positional
	# or label based, and if the latter, convert to positional
	slobj = self.index._convert_slice_indexer(key, kind="getitem")
	return self._slice(slobj)

Uh oh!

ENH: Use lazy copy for dropna #50429

ENH: Use lazy copy for dropna #50429

Uh oh!

Conversation

phofl commented Dec 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

phofl commented Dec 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!