-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add lazy copy to concat and round #50501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # pandas/tests/copy_view/test_methods.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this one! Added a few comments
pandas/core/reshape/concat.py
Outdated
) | ||
if _using_copy_on_write() and not self.copy: | ||
mgr = mgr.copy(deep=False) | ||
df = cons(mgr, copy=self.copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be doing a second copy now? (in case of copy=True
)
I am also wondering if we could do mgr = cons(data ...)._mgr
instead. That gives some overhead, but I worry a bit that not using cons
for parsing the dict of Series might impact behaviour of subclasses.
(have to think through it for the geopandas case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, set copy to false here
pandas/core/reshape/concat.py
Outdated
new_data._consolidate_inplace() | ||
elif _using_copy_on_write() and not self.copy: | ||
new_data = new_data.copy(deep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that concatenate_managers
already takes care of setting up references. Of course, concatenate_managers
is only used here, so it if it easier to implement this way, that's fine as well (but maybe add a comment about it that concatenate_managers
basically returned a view and didn't yet set up new_data.refs
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also not be sufficient this way for keeping track of the reference from parent to child. Because new_data
here will reference this intermediate BlockManager, but not the original BlockManager from the concat input. So the manager of the parent dataframes are not referenced by any result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking out this branch, I can confirm that is indeed an issue with the current implementation:
In [1]: pd.options.mode.copy_on_write = True
In [2]: df = DataFrame({"b": ["a"] * 3})
...: df2 = DataFrame({"a": ["a"] * 3})
In [3]: result = pd.concat([df, df2], axis=1)
In [4]: result
Out[4]:
b a
0 a a
1 a a
2 a a
In [8]: df._mgr._has_no_reference(0)
Out[8]: True
In [9]: df.iloc[0, 0] = 'c'
In [10]: result
Out[10]:
b a
0 c a
1 a a
2 a a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed the logic down and added your case as test. Works now
@@ -1009,6 +1010,7 @@ def test_to_hdf_with_object_column_names(tmp_path, setup_path): | |||
assert len(result) | |||
|
|||
|
|||
@pytest.mark.skipif(_using_copy_on_write(), reason="strides buggy with cow") | |||
def test_hdfstore_strides(setup_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this caused by the concat changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah strides changed from 8 to 16, this is really weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this a little bit, I think the issue is that in
Lines 3207 to 3212 in a0071f9
if len(dfs) > 0: | |
out = concat(dfs, axis=1) | |
out = out.reindex(columns=items, copy=False) | |
return out | |
return DataFrame(columns=axes[0], index=axes[1]) |
where we are creating the DataFrame, with the changes in this PR that concat
won't copy if CoW is enabled. But it seems that the data from the HDF store always come back as F contiguous, while in pandas we want it as C contiguous for optimal performance.
Maybe we should just add a copy=True
in the concat call in the mentioned snippet. That's already the default for non-CoW so won't change anything, and for CoW enabled also ensures we get the desired memory layout (at the expense of an extra copy while reading in, but to fix that, that can be a follow-up optimization investigating why the HDF store always returns F contiguous arrays)
Also commented about that on the related issue (that triggered adding this test): #22073
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this seems to fix it
# Conflicts: # pandas/tests/copy_view/test_methods.py
pandas/core/internals/concat.py
Outdated
@@ -283,8 +289,16 @@ def _concat_managers_axis0( | |||
nb._mgr_locs = nb._mgr_locs.add(offset) | |||
blocks.append(nb) | |||
|
|||
if not made_copy and not copy and using_copy_on_write(): | |||
refs.extend([weakref.ref(blk) for blk in mgr.blocks]) | |||
parent = mgr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are obviously overwriting parent here when we have more than one object, is this a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I suppose that is a problem. parent
is just used to keep a reference but is not otherwise used, so it doesn't really matter what object it is. So I think we can also make it a list of parent objects in this case.
(will try to make a test that fails because of this; generally it is for chained operations, which you might have less often with concat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "chained" concat (not very normal usage .., but still should work correctly ;)):
In [9]: pd.options.mode.copy_on_write = True
In [36]: res = pd.concat([pd.concat([df1, df2], axis=1), df3], axis=1)
In [37]: df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
In [38]: df2 = pd.DataFrame({'c': [4, 5, 6]})
In [39]: df3 = pd.DataFrame({'d': [4, 5, 6]})
In [40]: res = pd.concat([pd.concat([df1, df2], axis=1), df3], axis=1)
In [41]: df1._mgr._has_no_reference(0)
Out[41]: True
In [42]: df1.iloc[0, 0] = 100
In [43]: res
Out[43]:
a b c d
0 100 0.1 4 4
1 2 0.2 5 5
2 3 0.3 6 6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, definitely agree. I'll keep track of a list of parent objects then. Series does not work correctly either, but this should be straightforward to fix I think
Would merge tomorrow or so if no objections |
Co-authored-by: Joris Van den Bossche <[email protected]>
pandas/core/reshape/concat.py
Outdated
) | ||
if using_copy_on_write() and not self.copy: | ||
mgr.parent = [obj._mgr for obj in self.objs] | ||
mgr.refs = [weakref.ref(obj._mgr.blocks[0]) for obj in self.objs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would very much prefer that we can keep this ref handling only to the internals ..
I was thinking, in theory it should be possible to let dict_to_mgr
do this? (that might actually also help for the DataFrame(dict of Series)
constructor case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that might be quite a bit more complicated though (since that function needs to handle more than just this case), so we could also leave that for a follow-up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general, but we have to be very careful.
Would you be ok with pushing this to the internals after finishing the constructors or when doing the constructors? We don't have many tests in the constructor area yet and I don't want to break anything without noticing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly because this is tricky, I think it would be good to have a general rule that only the internals need to care about this, and that outside of the internals all the APIs we use take care of that for you.
But yes, let's do this when tackling those functions for the constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushing this to the internals after finishing the constructors or when doing the constructors?
I am doing this in #50777. Will update that PR to also refactor the above once this is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I'll fix the typing stuff and then merge so that we can continue with the methods blocked by concat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And a bunch of tiny nitpicks
pandas/core/reshape/concat.py
Outdated
|
||
import numpy as np | ||
|
||
from pandas._config import get_option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas._config import get_option | |
from pandas._config import get_option, using_copy_on_write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is annoying, I thought that ci should fail, but makes sense that the import continues to work through there...
pandas/core/reshape/concat.py
Outdated
@@ -47,6 +50,8 @@ | |||
get_unanimous_names, | |||
) | |||
from pandas.core.internals import concatenate_managers | |||
from pandas.core.internals.construction import dict_to_mgr | |||
from pandas.core.internals.managers import using_copy_on_write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.core.internals.managers import using_copy_on_write |
df1 = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]}) | ||
|
||
df2 = DataFrame({"c": [4, 5, 6]}) | ||
|
||
df3 = DataFrame({"d": [4, 5, 6]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df1 = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]}) | |
df2 = DataFrame({"c": [4, 5, 6]}) | |
df3 = DataFrame({"d": [4, 5, 6]}) | |
df1 = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]}) | |
df2 = DataFrame({"c": [4, 5, 6]}) | |
df3 = DataFrame({"d": [4, 5, 6]}) |
tm.assert_series_equal(ser2, ser2_orig) | ||
|
||
|
||
def test_chained_concat(using_copy_on_write): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_chained_concat(using_copy_on_write): | |
def test_concat_frames_chained(using_copy_on_write): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a similar test for series with chained concat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_concat_series_parent_obj(using_copy_on_write): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_concat_series_parent_obj(using_copy_on_write): | |
def test_concat_series_updating_input(using_copy_on_write): |
(that's the name pattern you used above for frames)
@@ -24,6 +24,7 @@ | |||
timedelta_range, | |||
) | |||
import pandas._testing as tm | |||
from pandas.core.internals.managers import using_copy_on_write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.core.internals.managers import using_copy_on_write | |
from pandas._config import using_copy_on_write |
pandas/core/frame.py
Outdated
@@ -730,6 +730,7 @@ def __init__( | |||
columns, | |||
dtype=dtype, | |||
typ=manager, | |||
copy=copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary to avoid a copy when transforming from series to df in mixed case. Should we hide this behind the CoW flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this will change behaviour (for the default non-CoW mode) if you pass copy=False
? (and so actually for the default copy=None, as that is translated to copy=False in such a case). Currently that still does make a copy always:
In [9]: s = pd.Series([1, 2, 3], name="a")
In [10]: df = pd.DataFrame(s, copy=False)
In [11]: s.iloc[0] = 100
In [12]: df
Out[12]:
a
0 1
1 2
2 3
One could consider that a bug (but then we also need to add testes for that case here), but given that we probably want to change that in the future to actually behave as a copy (i.e. the current behaviour), I would maybe not fix the "bug".
So, yes, then I would hide this behind a using_copy_on_write()
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the option check
# Conflicts: # pandas/tests/copy_view/test_methods.py
# mypy needs to know sample is not an NDFrame | ||
sample = cast("DataFrame | Series", sample) | ||
obj = sample._constructor({name: obj}) | ||
obj = sample._constructor(obj, columns=[name], copy=False) | ||
obj.columns = [new_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test that covers this mixed case for checking that the CoW works correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the chained series case covers this, but will add an explicit one as well after hiding the change behind the cow flag
Thanks @phofl ! |
For future reference: the usage of internal APIs outside the internals (in reshape/concat.py) is being removed again in #50777 |
Thx |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.