PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562

jorisvandenbossche · 2021-11-21T16:22:51Z

#42689 removed an "unwanted" copy, but I actually added this on purpose. This is to ensure we store contiguous 1D arrays by default, which is important for performance reasons.

This reverts commit cb3b4e4.

jbrockmendel · 2021-11-21T17:03:33Z

pandas/tests/frame/methods/test_values.py

@@ -226,7 +226,10 @@ def test_values_lcd(self, mixed_float_frame, mixed_int_frame):


 class TestPrivateValues:
-    def test_private_values_dt64tz(self, request):


on the off-chance this PR isn't merged, should remove the request arg here

using_array_manager isnt used?

Yep, and request also not anymore. Updated

pandas/tests/frame/test_constructors.py

jbrockmendel · 2021-11-21T17:05:59Z

pandas/core/internals/construction.py

                )
                for i in range(values.shape[1])
            ]
        else:
            if is_datetime_or_timedelta_dtype(values.dtype):
                values = ensure_wrapped_if_datetimelike(values)
-            arrays = [values[:, i] for i in range(values.shape[1])]
+            # copy the array to ensure contiguous memory
+            arrays = [values[:, i].copy() for i in range(values.shape[1])]


could do the copy+comment once on the next line instead of both here and L352

I think that would require an additional list comprehension over the arrays?

jbrockmendel · 2021-11-21T17:07:55Z

this means we're not respecting the copy=False passed to DataFrame constructor?

jorisvandenbossche · 2021-11-21T17:22:10Z

this means we're not respecting the copy=False passed to DataFrame constructor?

Yes, and that was on purpose. Note that the intent was to have this as the default for now, and we can see later if we can provide an (optional) optimization for people that care about preserving a view on the 2D array.

Having this as the default for now also makes it easier to compare benchmarks, otherwise we would need to update all ASVs to add a copy() if a dataframe is created from an ndarray.

Also in any case, df.values will likely never be a view on the original ndarray arr in the standard case, even if DataFrame(arr) preserves a view on arr (because otherwise we would need to detect that each array is a subsequent slice of a single parent array, which I don't think we want to start doing). So in that sense, I think the # TODO(ArrayManager) keep view on 2D array? could be removed anyway now I updated the test to use multiple columns (and not be about the 1-column special case)

jorisvandenbossche · 2021-11-21T17:35:13Z

We talked before (on the mailing list discussion about BlockManager/ArrayManager) about a potential way to keep DataFrame(array).values a fast roundtrip for the ndarray (eg by delaying creating the manager in this case), which would be useful for eg scikit-learn. But I see that as a future potential optimization to look into, if this turns out to be a bottleneck / import use case.

BTW, I could already add some logic to honor an explicit copy=False (with the default copy=None preserve the "ensure contiguous 1D array" behaviour)

jbrockmendel · 2021-11-21T19:17:44Z

this means we're not respecting the copy=False passed to DataFrame constructor?

Yes, and that was on purpose. Note that the intent was to have this as the default for now, and we can see later if we can provide an (optional) optimization for people that care about preserving a view on the 2D array.

If that's the case, then it can/should be treated the same way as we currently treat dicts passed to the DataFrame constructor (where the default is copy=None). i.e. it should be made explicit and documented.

…ame-ndarray

jorisvandenbossche · 2021-11-22T08:36:20Z

If that's the case, then it can/should be treated the same way as we currently treat dicts passed to the DataFrame constructor (where the default is copy=None). i.e. it should be made explicit and documented.

@jbrockmendel updated the PR to handle this through copy=None which is translated into copy=True if using array manager and input is ndarray, so using the same way how this keyword is updated for dicts.

I didn't add this to the public docstring of DataFrame because currently ArrayManager is not yet documented in general, but it's now documented for devs in the comments of the the copy=None handling.

…ame-ndarray

jreback

can you verify that the existing BM code paths are unchanged by this in perf (e.g. run some asv's for construction)

jreback · 2021-11-23T15:56:47Z

pandas/core/internals/construction.py

                )
                for i in range(values.shape[1])
            ]
+            if copy:


i would move the copy after the main if/else here for clarify

Yes, good idea, updated

jorisvandenbossche · 2021-11-23T16:05:46Z

can you verify that the existing BM code paths are unchanged by this in perf

So in the constructor, the only thing that changes in the BM path is one additional elif manager == "array" check, which should not have significant performance impact.

pandas/core/frame.py

pandas/core/internals/construction.py

jbrockmendel · 2021-11-24T22:03:57Z

couple small comments, otherwise looks good

…ame-ndarray

jorisvandenbossche · 2021-11-29T21:22:52Z

The one failure is an unrelated unexpected ResourceWarning

jorisvandenbossche added 2 commits November 21, 2021 16:51

Revert "BUG: ArrayManager construction unwanted copy (pandas-dev#42689)"

2ff8667

This reverts commit cb3b4e4.

add comment and test

03e568e

jbrockmendel reviewed Nov 21, 2021

View reviewed changes

pandas/tests/frame/test_constructors.py Show resolved Hide resolved

jbrockmendel reviewed Nov 21, 2021

View reviewed changes

jorisvandenbossche added 3 commits November 22, 2021 08:23

Merge remote-tracking branch 'upstream/master' into am-constructor-fr…

2f030b1

…ame-ndarray

control through copy=None keyword

a515fd1

add explicit test

2ac744e

limit to 2D array input

53a744a

jorisvandenbossche added the ArrayManager label Nov 23, 2021

jorisvandenbossche added this to the 1.4 milestone Nov 23, 2021

jorisvandenbossche added 2 commits November 23, 2021 15:46

Merge remote-tracking branch 'upstream/master' into am-constructor-fr…

1aff092

…ame-ndarray

update test_private_values_dt64tz

1bb6af5

jreback requested changes Nov 23, 2021

View reviewed changes

jorisvandenbossche added 2 commits November 23, 2021 17:01

copy after if/else

7e96eae

remove additional copy

655b8ad

jbrockmendel reviewed Nov 24, 2021

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Nov 24, 2021

View reviewed changes

pandas/core/internals/construction.py Outdated Show resolved Hide resolved

jorisvandenbossche added 2 commits November 25, 2021 20:29

feedback

1a69f98

Merge remote-tracking branch 'upstream/master' into am-constructor-fr…

809f72f

…ame-ndarray

remove unused test arguments

a4f76bf

jorisvandenbossche merged commit 6861fc5 into pandas-dev:master Nov 30, 2021

jorisvandenbossche deleted the am-constructor-frame-ndarray branch November 30, 2021 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562

PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562

jorisvandenbossche commented Nov 21, 2021

jbrockmendel Nov 21, 2021

jbrockmendel Nov 29, 2021

jorisvandenbossche Nov 29, 2021

jbrockmendel Nov 21, 2021

jorisvandenbossche Nov 21, 2021

jbrockmendel commented Nov 21, 2021

jorisvandenbossche commented Nov 21, 2021

jorisvandenbossche commented Nov 21, 2021

jbrockmendel commented Nov 21, 2021

jorisvandenbossche commented Nov 22, 2021

jreback left a comment

jreback Nov 23, 2021

jorisvandenbossche Nov 23, 2021

jorisvandenbossche commented Nov 23, 2021

jbrockmendel commented Nov 24, 2021

jorisvandenbossche commented Nov 29, 2021 •

edited

Loading

		@@ -226,7 +226,10 @@ def test_values_lcd(self, mixed_float_frame, mixed_int_frame):


		class TestPrivateValues:
		def test_private_values_dt64tz(self, request):

PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562

PERF/BUG: ensure we store contiguous arrays in DataFrame(ndarray) for ArrayManager #44562

Conversation

jorisvandenbossche commented Nov 21, 2021

jbrockmendel Nov 21, 2021

Choose a reason for hiding this comment

jbrockmendel Nov 29, 2021

Choose a reason for hiding this comment

jorisvandenbossche Nov 29, 2021

Choose a reason for hiding this comment

jbrockmendel Nov 21, 2021

Choose a reason for hiding this comment

jorisvandenbossche Nov 21, 2021

Choose a reason for hiding this comment

jbrockmendel commented Nov 21, 2021

jorisvandenbossche commented Nov 21, 2021

jorisvandenbossche commented Nov 21, 2021

jbrockmendel commented Nov 21, 2021

jorisvandenbossche commented Nov 22, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback Nov 23, 2021

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 23, 2021

jbrockmendel commented Nov 24, 2021

jorisvandenbossche commented Nov 29, 2021 • edited Loading

jorisvandenbossche commented Nov 29, 2021 •

edited

Loading