ENH: Add lazy copy to concat and round #50501

phofl · 2022-12-30T12:06:06Z

xref ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

# Conflicts: # pandas/tests/copy_view/test_methods.py

jorisvandenbossche

Thanks for looking into this one! Added a few comments

pandas/tests/copy_view/test_functions.py

jorisvandenbossche · 2023-01-02T15:09:20Z

pandas/core/reshape/concat.py

+                )
+                if _using_copy_on_write() and not self.copy:
+                    mgr = mgr.copy(deep=False)
+                df = cons(mgr, copy=self.copy)


This might be doing a second copy now? (in case of copy=True)

I am also wondering if we could do mgr = cons(data ...)._mgr instead. That gives some overhead, but I worry a bit that not using cons for parsing the dict of Series might impact behaviour of subclasses.
(have to think through it for the geopandas case)

Good point, set copy to false here

jorisvandenbossche · 2023-01-02T15:12:39Z

pandas/core/reshape/concat.py

                new_data._consolidate_inplace()
+            elif _using_copy_on_write() and not self.copy:
+                new_data = new_data.copy(deep=False)


I would expect that concatenate_managers already takes care of setting up references. Of course, concatenate_managers is only used here, so it if it easier to implement this way, that's fine as well (but maybe add a comment about it that concatenate_managers basically returned a view and didn't yet set up new_data.refs)

It might also not be sufficient this way for keeping track of the reference from parent to child. Because new_data here will reference this intermediate BlockManager, but not the original BlockManager from the concat input. So the manager of the parent dataframes are not referenced by any result.

Checking out this branch, I can confirm that is indeed an issue with the current implementation:

In [1]: pd.options.mode.copy_on_write = True In [2]: df = DataFrame({"b": ["a"] * 3}) ...: df2 = DataFrame({"a": ["a"] * 3}) In [3]: result = pd.concat([df, df2], axis=1) In [4]: result Out[4]: b a 0 a a 1 a a 2 a a In [8]: df._mgr._has_no_reference(0) Out[8]: True In [9]: df.iloc[0, 0] = 'c' In [10]: result Out[10]: b a 0 c a 1 a a 2 a a

Pushed the logic down and added your case as test. Works now

jorisvandenbossche · 2023-01-02T15:24:35Z

pandas/tests/io/pytables/test_store.py

@@ -1009,6 +1010,7 @@ def test_to_hdf_with_object_column_names(tmp_path, setup_path):
            assert len(result)


+@pytest.mark.skipif(_using_copy_on_write(), reason="strides buggy with cow")
 def test_hdfstore_strides(setup_path):


Is this caused by the concat changes?

Yeah strides changed from 8 to 16, this is really weird

Looking into this a little bit, I think the issue is that in

pandas/pandas/io/pytables.py

Lines 3207 to 3212 in a0071f9

if len(dfs) > 0:

out = concat(dfs, axis=1)

out = out.reindex(columns=items, copy=False)

return out

return DataFrame(columns=axes[0], index=axes[1])

where we are creating the DataFrame, with the changes in this PR that concat won't copy if CoW is enabled. But it seems that the data from the HDF store always come back as F contiguous, while in pandas we want it as C contiguous for optimal performance.
Maybe we should just add a copy=True in the concat call in the mentioned snippet. That's already the default for non-CoW so won't change anything, and for CoW enabled also ensures we get the desired memory layout (at the expense of an extra copy while reading in, but to fix that, that can be a follow-up optimization investigating why the HDF store always returns F contiguous arrays)

Also commented about that on the related issue (that triggered adding this test): #22073

Yeah this seems to fix it

# Conflicts: # pandas/tests/copy_view/test_methods.py

phofl · 2023-01-07T12:12:52Z

pandas/core/internals/concat.py

@@ -283,8 +289,16 @@ def _concat_managers_axis0(
            nb._mgr_locs = nb._mgr_locs.add(offset)
            blocks.append(nb)

+        if not made_copy and not copy and using_copy_on_write():
+            refs.extend([weakref.ref(blk) for blk in mgr.blocks])
+            parent = mgr


We are obviously overwriting parent here when we have more than one object, is this a problem?

Ah, yes, I suppose that is a problem. parent is just used to keep a reference but is not otherwise used, so it doesn't really matter what object it is. So I think we can also make it a list of parent objects in this case.

(will try to make a test that fails because of this; generally it is for chained operations, which you might have less often with concat)

A "chained" concat (not very normal usage .., but still should work correctly ;)):

In [9]: pd.options.mode.copy_on_write = True In [36]: res = pd.concat([pd.concat([df1, df2], axis=1), df3], axis=1) In [37]: df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]}) In [38]: df2 = pd.DataFrame({'c': [4, 5, 6]}) In [39]: df3 = pd.DataFrame({'d': [4, 5, 6]}) In [40]: res = pd.concat([pd.concat([df1, df2], axis=1), df3], axis=1) In [41]: df1._mgr._has_no_reference(0) Out[41]: True In [42]: df1.iloc[0, 0] = 100 In [43]: res Out[43]: a b c d 0 100 0.1 4 4 1 2 0.2 5 5 2 3 0.3 6 6

Yep, definitely agree. I'll keep track of a list of parent objects then. Series does not work correctly either, but this should be straightforward to fix I think

phofl · 2023-01-12T22:15:45Z

Would merge tomorrow or so if no objections

pandas/tests/copy_view/test_functions.py

pandas/core/internals/concat.py

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche · 2023-01-13T15:33:55Z

pandas/core/reshape/concat.py

+                )
+                if using_copy_on_write() and not self.copy:
+                    mgr.parent = [obj._mgr for obj in self.objs]
+                    mgr.refs = [weakref.ref(obj._mgr.blocks[0]) for obj in self.objs]


I would very much prefer that we can keep this ref handling only to the internals ..
I was thinking, in theory it should be possible to let dict_to_mgr do this? (that might actually also help for the DataFrame(dict of Series) constructor case)

(that might be quite a bit more complicated though (since that function needs to handle more than just this case), so we could also leave that for a follow-up)

I agree in general, but we have to be very careful.

Would you be ok with pushing this to the internals after finishing the constructors or when doing the constructors? We don't have many tests in the constructor area yet and I don't want to break anything without noticing.

Exactly because this is tricky, I think it would be good to have a general rule that only the internals need to care about this, and that outside of the internals all the APIs we use take care of that for you.

But yes, let's do this when tackling those functions for the constructors.

pushing this to the internals after finishing the constructors or when doing the constructors?

I am doing this in #50777. Will update that PR to also refactor the above once this is merged

Sounds good! I'll fix the typing stuff and then merge so that we can continue with the methods blocked by concat

jorisvandenbossche

And a bunch of tiny nitpicks

jorisvandenbossche · 2023-01-13T15:38:59Z

pandas/core/reshape/concat.py


 import numpy as np

+from pandas._config import get_option


Suggested change

from pandas._config import get_option

from pandas._config import get_option, using_copy_on_write

Ah this is annoying, I thought that ci should fail, but makes sense that the import continues to work through there...

jorisvandenbossche · 2023-01-13T15:39:08Z

pandas/core/reshape/concat.py

@@ -47,6 +50,8 @@
    get_unanimous_names,
 )
 from pandas.core.internals import concatenate_managers
+from pandas.core.internals.construction import dict_to_mgr
+from pandas.core.internals.managers import using_copy_on_write


Suggested change

from pandas.core.internals.managers import using_copy_on_write

jorisvandenbossche · 2023-01-13T15:41:02Z

pandas/tests/copy_view/test_functions.py

+    df1 = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]})
+
+    df2 = DataFrame({"c": [4, 5, 6]})
+
+    df3 = DataFrame({"d": [4, 5, 6]})


Suggested change

df1 = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]})

df2 = DataFrame({"c": [4, 5, 6]})

df3 = DataFrame({"d": [4, 5, 6]})

df1 = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]})

df2 = DataFrame({"c": [4, 5, 6]})

df3 = DataFrame({"d": [4, 5, 6]})

jorisvandenbossche · 2023-01-13T15:41:47Z

pandas/tests/copy_view/test_functions.py

+    tm.assert_series_equal(ser2, ser2_orig)
+
+
+def test_chained_concat(using_copy_on_write):


Suggested change

def test_chained_concat(using_copy_on_write):

def test_concat_frames_chained(using_copy_on_write):

We might need a similar test for series with chained concat?

jorisvandenbossche · 2023-01-13T15:42:46Z

pandas/tests/copy_view/test_functions.py

+    tm.assert_frame_equal(result, expected)
+
+
+def test_concat_series_parent_obj(using_copy_on_write):


Suggested change

def test_concat_series_parent_obj(using_copy_on_write):

def test_concat_series_updating_input(using_copy_on_write):

(that's the name pattern you used above for frames)

jorisvandenbossche · 2023-01-13T15:43:16Z

pandas/tests/io/pytables/test_store.py

@@ -24,6 +24,7 @@
    timedelta_range,
 )
 import pandas._testing as tm
+from pandas.core.internals.managers import using_copy_on_write


Suggested change

from pandas.core.internals.managers import using_copy_on_write

from pandas._config import using_copy_on_write

phofl · 2023-01-13T16:43:45Z

pandas/core/frame.py

@@ -730,6 +730,7 @@ def __init__(
                    columns,
                    dtype=dtype,
                    typ=manager,
+                    copy=copy,


This is necessary to avoid a copy when transforming from series to df in mixed case. Should we hide this behind the CoW flag?

I suppose this will change behaviour (for the default non-CoW mode) if you pass copy=False ? (and so actually for the default copy=None, as that is translated to copy=False in such a case). Currently that still does make a copy always:

In [9]: s = pd.Series([1, 2, 3], name="a") In [10]: df = pd.DataFrame(s, copy=False) In [11]: s.iloc[0] = 100 In [12]: df Out[12]: a 0 1 1 2 2 3

One could consider that a bug (but then we also need to add testes for that case here), but given that we probably want to change that in the future to actually behave as a copy (i.e. the current behaviour), I would maybe not fix the "bug".
So, yes, then I would hide this behind a using_copy_on_write() check.

Added the option check

# Conflicts: # pandas/tests/copy_view/test_methods.py

jorisvandenbossche · 2023-01-16T08:33:46Z

pandas/core/reshape/concat.py

                    # mypy needs to know sample is not an NDFrame
                    sample = cast("DataFrame | Series", sample)
-                    obj = sample._constructor({name: obj})
+                    obj = sample._constructor(obj, columns=[name], copy=False)
+                    obj.columns = [new_name]


Do we have a test that covers this mixed case for checking that the CoW works correctly?

Yep the chained series case covers this, but will add an explicit one as well after hiding the change behind the cow flag

jorisvandenbossche · 2023-01-17T08:23:31Z

Thanks @phofl !

jorisvandenbossche · 2023-01-17T08:40:48Z

For future reference: the usage of internal APIs outside the internals (in reshape/concat.py) is being removed again in #50777

phofl · 2023-01-17T09:58:03Z

Thx

phofl added 7 commits December 24, 2022 11:04

Start with implmenting concat for cow

94c982e

Merge remote-tracking branch 'upstream/main' into cow_concat_new

7fbba4a

# Conflicts: # pandas/tests/copy_view/test_methods.py

Fix concat logic

0db1c39

Add series test

fc564f8

Clean up

a7165c2

Skip test

20a62bb

Fix test

bca2123

jorisvandenbossche mentioned this pull request Jan 2, 2023

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Closed

73 tasks

jorisvandenbossche reviewed Jan 2, 2023

View reviewed changes

jorisvandenbossche added the Copy / view semantics label Jan 2, 2023

jorisvandenbossche added this to the 2.0 milestone Jan 2, 2023

phofl added 4 commits January 7, 2023 12:46

Add check

3c4abd2

Merge remote-tracking branch 'upstream/main' into cow_concat_new

7734979

# Conflicts: # pandas/tests/copy_view/test_methods.py

Push logic down

8bc3272

Fix test

79cedd4

phofl commented Jan 7, 2023

View reviewed changes

phofl and others added 2 commits January 7, 2023 22:47

Fix typing

6c8a704

Merge branch 'main' into cow_concat_new

de678ab

jorisvandenbossche reviewed Jan 13, 2023

View reviewed changes

pandas/tests/copy_view/test_functions.py Show resolved Hide resolved

jorisvandenbossche reviewed Jan 13, 2023

View reviewed changes

pandas/core/internals/concat.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Jan 13, 2023

View reviewed changes

pandas/core/internals/concat.py Outdated Show resolved Hide resolved

phofl added 4 commits January 13, 2023 10:46

Fix annotation

dddc9f0

Fix chained concat

d18416d

Pass to bm

5ccf23e

Fix series case

7648b23

jorisvandenbossche reviewed Jan 13, 2023

View reviewed changes

pandas/core/internals/concat.py Outdated Show resolved Hide resolved

Update pandas/core/internals/concat.py

99da573

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche reviewed Jan 13, 2023

View reviewed changes

phofl added 3 commits January 13, 2023 17:05

Fix concat

a6afa8b

Merge remote-tracking branch 'upstream/main' into cow_concat_new

3d0772c

Fix chained series case

8b49210

phofl commented Jan 13, 2023

View reviewed changes

phofl added 4 commits January 13, 2023 17:44

Merge remote-tracking branch 'upstream/main' into cow_concat_new

48a4f7e

# Conflicts: # pandas/tests/copy_view/test_methods.py

Merge remote-tracking branch 'upstream/main' into cow_concat_new

690f072

Fix mypy

8ba35fe

Fix bug

2047d5c

jorisvandenbossche reviewed Jan 16, 2023

View reviewed changes

phofl added 3 commits January 16, 2023 11:43

Add test and flag

e959c53

Fix hdf test

891e07c

Fix mypy

04adc17

jorisvandenbossche approved these changes Jan 16, 2023

View reviewed changes

jorisvandenbossche merged commit c426dc0 into pandas-dev:main Jan 17, 2023

phofl deleted the cow_concat_new branch January 17, 2023 09:58

	if len(dfs) > 0:
	out = concat(dfs, axis=1)
	out = out.reindex(columns=items, copy=False)
	return out

	return DataFrame(columns=axes[0], index=axes[1])

	from pandas._config import get_option
	from pandas._config import get_option, using_copy_on_write

		tm.assert_series_equal(ser2, ser2_orig)


		def test_chained_concat(using_copy_on_write):

	def test_chained_concat(using_copy_on_write):
	def test_concat_frames_chained(using_copy_on_write):

		tm.assert_frame_equal(result, expected)


		def test_concat_series_parent_obj(using_copy_on_write):

	def test_concat_series_parent_obj(using_copy_on_write):
	def test_concat_series_updating_input(using_copy_on_write):

	from pandas.core.internals.managers import using_copy_on_write
	from pandas._config import using_copy_on_write

Uh oh!

ENH: Add lazy copy to concat and round #50501

ENH: Add lazy copy to concat and round #50501

Conversation

phofl commented Dec 30, 2022

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl commented Jan 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jorisvandenbossche Jan 16, 2023 •

edited

Loading