-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562
Conversation
@@ -226,7 +226,10 @@ def test_values_lcd(self, mixed_float_frame, mixed_int_frame): | |||
|
|||
|
|||
class TestPrivateValues: | |||
def test_private_values_dt64tz(self, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the off-chance this PR isn't merged, should remove the request arg here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using_array_manager isnt used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, and request also not anymore. Updated
) | ||
for i in range(values.shape[1]) | ||
] | ||
else: | ||
if is_datetime_or_timedelta_dtype(values.dtype): | ||
values = ensure_wrapped_if_datetimelike(values) | ||
arrays = [values[:, i] for i in range(values.shape[1])] | ||
# copy the array to ensure contiguous memory | ||
arrays = [values[:, i].copy() for i in range(values.shape[1])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do the copy+comment once on the next line instead of both here and L352
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would require an additional list comprehension over the arrays?
this means we're not respecting the copy=False passed to DataFrame constructor? |
Yes, and that was on purpose. Note that the intent was to have this as the default for now, and we can see later if we can provide an (optional) optimization for people that care about preserving a view on the 2D array. Having this as the default for now also makes it easier to compare benchmarks, otherwise we would need to update all ASVs to add a Also in any case, |
We talked before (on the mailing list discussion about BlockManager/ArrayManager) about a potential way to keep BTW, I could already add some logic to honor an explicit |
If that's the case, then it can/should be treated the same way as we currently treat dicts passed to the DataFrame constructor (where the default is copy=None). i.e. it should be made explicit and documented. |
@jbrockmendel updated the PR to handle this through I didn't add this to the public docstring of DataFrame because currently ArrayManager is not yet documented in general, but it's now documented for devs in the comments of the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you verify that the existing BM code paths are unchanged by this in perf (e.g. run some asv's for construction)
) | ||
for i in range(values.shape[1]) | ||
] | ||
if copy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would move the copy after the main if/else here for clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea, updated
So in the constructor, the only thing that changes in the BM path is one additional |
couple small comments, otherwise looks good |
The one failure is an unrelated unexpected ResourceWarning |
#42689 removed an "unwanted" copy, but I actually added this on purpose. This is to ensure we store contiguous 1D arrays by default, which is important for performance reasons.