Skip to content

ENH: Add lazy copy for take and between_time #50476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 13, 2023
2 changes: 2 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3779,6 +3779,8 @@ def _take(

See the docstring of `take` for full explanation of the parameters.
"""
if axis == 0 and np.array_equal(indices, np.arange(0, len(self))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a quick check of how costly this is compared to the actual take? (to ensure we don't introduce a performance regression for the other cases)

Maybe could also first check if the length of indices is equal to len(self)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could also first check if the length of indices is equal to len(self)

np.array_equal actually already does that, so doing that ourselves is not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I profiled this, the cost is reduced if the DataFrame gets larger.

The actual problem is a bit different though: array_equal has no early exit, even if the first value is different, it checks the whole array (wanted to bring this up tomorrow as well). We should probably write something ourselves, because I guess we will need that a couple of times.

return self.copy(deep=None)

new_data = self._mgr.take(
indices,
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@
SingleArrayManager,
SingleBlockManager,
)
from pandas.core.internals.managers import _using_copy_on_write
from pandas.core.shared_docs import _shared_docs
from pandas.core.sorting import (
ensure_key_mapped,
Expand Down Expand Up @@ -877,6 +878,10 @@ def take(self, indices, axis: Axis = 0, **kwargs) -> Series:
nv.validate_take((), kwargs)

indices = ensure_platform_int(indices)

if _using_copy_on_write() and np.array_equal(indices, np.arange(0, len(self))):
return self.copy(deep=None)

new_index = self.index.take(indices)
new_values = self._values.take(indices)

Expand Down
34 changes: 34 additions & 0 deletions pandas/tests/copy_view/test_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
DataFrame,
MultiIndex,
Series,
date_range,
)
import pandas._testing as tm
from pandas.tests.copy_view.util import get_array
Expand Down Expand Up @@ -387,6 +388,39 @@ def test_assign_drop_duplicates(using_copy_on_write, method):
tm.assert_frame_equal(df, df_orig)


@pytest.mark.parametrize("obj", [Series([1, 2]), DataFrame({"a": [1, 2]})])
def test_take(using_copy_on_write, obj):
obj_orig = obj.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick: can you add a comment here that is testing the corner case of taking all rows? (because in general take always by definition returns a copy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

obj2 = obj.take([0, 1])

if using_copy_on_write:
assert np.shares_memory(obj2.values, obj.values)
else:
assert not np.shares_memory(obj2.values, obj.values)

obj2.iloc[0] = 0
if using_copy_on_write:
assert not np.shares_memory(obj2.values, obj.values)
tm.assert_equal(obj, obj_orig)


@pytest.mark.parametrize("obj", [Series([1, 2]), DataFrame({"a": [1, 2]})])
def test_between_time(using_copy_on_write, obj):
obj.index = date_range("2018-04-09", periods=2, freq="1D20min")
obj_orig = obj.copy()
obj2 = obj.between_time("0:00", "1:00")

if using_copy_on_write:
assert np.shares_memory(obj2.values, obj.values)
else:
assert not np.shares_memory(obj2.values, obj.values)

obj2.iloc[0] = 0
if using_copy_on_write:
assert not np.shares_memory(obj2.values, obj.values)
tm.assert_equal(obj, obj_orig)


def test_reindex_like(using_copy_on_write):
df = DataFrame({"a": [1, 2], "b": "a"})
other = DataFrame({"b": "a", "a": [1, 2]})
Expand Down