Skip to content

PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

#42689 removed an "unwanted" copy, but I actually added this on purpose. This is to ensure we store contiguous 1D arrays by default, which is important for performance reasons.

@@ -226,7 +226,10 @@ def test_values_lcd(self, mixed_float_frame, mixed_int_frame):


class TestPrivateValues:
def test_private_values_dt64tz(self, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the off-chance this PR isn't merged, should remove the request arg here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using_array_manager isnt used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and request also not anymore. Updated

)
for i in range(values.shape[1])
]
else:
if is_datetime_or_timedelta_dtype(values.dtype):
values = ensure_wrapped_if_datetimelike(values)
arrays = [values[:, i] for i in range(values.shape[1])]
# copy the array to ensure contiguous memory
arrays = [values[:, i].copy() for i in range(values.shape[1])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do the copy+comment once on the next line instead of both here and L352

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would require an additional list comprehension over the arrays?

@jbrockmendel
Copy link
Member

this means we're not respecting the copy=False passed to DataFrame constructor?

@jorisvandenbossche
Copy link
Member Author

this means we're not respecting the copy=False passed to DataFrame constructor?

Yes, and that was on purpose. Note that the intent was to have this as the default for now, and we can see later if we can provide an (optional) optimization for people that care about preserving a view on the 2D array.

Having this as the default for now also makes it easier to compare benchmarks, otherwise we would need to update all ASVs to add a copy() if a dataframe is created from an ndarray.

Also in any case, df.values will likely never be a view on the original ndarray arr in the standard case, even if DataFrame(arr) preserves a view on arr (because otherwise we would need to detect that each array is a subsequent slice of a single parent array, which I don't think we want to start doing). So in that sense, I think the # TODO(ArrayManager) keep view on 2D array? could be removed anyway now I updated the test to use multiple columns (and not be about the 1-column special case)

@jorisvandenbossche
Copy link
Member Author

We talked before (on the mailing list discussion about BlockManager/ArrayManager) about a potential way to keep DataFrame(array).values a fast roundtrip for the ndarray (eg by delaying creating the manager in this case), which would be useful for eg scikit-learn. But I see that as a future potential optimization to look into, if this turns out to be a bottleneck / import use case.

BTW, I could already add some logic to honor an explicit copy=False (with the default copy=None preserve the "ensure contiguous 1D array" behaviour)

@jbrockmendel
Copy link
Member

this means we're not respecting the copy=False passed to DataFrame constructor?

Yes, and that was on purpose. Note that the intent was to have this as the default for now, and we can see later if we can provide an (optional) optimization for people that care about preserving a view on the 2D array.

If that's the case, then it can/should be treated the same way as we currently treat dicts passed to the DataFrame constructor (where the default is copy=None). i.e. it should be made explicit and documented.

@jorisvandenbossche
Copy link
Member Author

If that's the case, then it can/should be treated the same way as we currently treat dicts passed to the DataFrame constructor (where the default is copy=None). i.e. it should be made explicit and documented.

@jbrockmendel updated the PR to handle this through copy=None which is translated into copy=True if using array manager and input is ndarray, so using the same way how this keyword is updated for dicts.

I didn't add this to the public docstring of DataFrame because currently ArrayManager is not yet documented in general, but it's now documented for devs in the comments of the the copy=None handling.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you verify that the existing BM code paths are unchanged by this in perf (e.g. run some asv's for construction)

)
for i in range(values.shape[1])
]
if copy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would move the copy after the main if/else here for clarify

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea, updated

@jorisvandenbossche
Copy link
Member Author

can you verify that the existing BM code paths are unchanged by this in perf

So in the constructor, the only thing that changes in the BM path is one additional elif manager == "array" check, which should not have significant performance impact.

@jbrockmendel
Copy link
Member

couple small comments, otherwise looks good

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 29, 2021

The one failure is an unrelated unexpected ResourceWarning

@jorisvandenbossche jorisvandenbossche merged commit 6861fc5 into pandas-dev:master Nov 30, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-constructor-frame-ndarray branch November 30, 2021 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants