PERF: DataFrame backed by non-contiguous masked EAs #52338

lukemanley · 2023-03-31T23:21:22Z

Note the significant difference in performance below for "equal" dataframes:

import pandas as pd
import numpy as np

mi = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
data = np.random.randn(len(mi), 5)

df1 = pd.DataFrame(data, index=mi, dtype="Float64")
df2 = pd.DataFrame(data, index=mi).astype("Float64")

assert df1.equals(df2)

%timeit df1.unstack()
%timeit df2.unstack()

timings:

# 15.9 s ± 438 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)       <- SLOW
# 264 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference being:

print(df1[0].array._data.flags.contiguous)   # False
print(df2[0].array._data.flags.contiguous)   # True

df1 maintains the memory layout of the original 2D array, which results in non-contiguous 1D EA arrays.
df2 ends up being contiguous because .astype defaults to copy=True and the copy results in a contiguous layout for the 1D EA arrays.

Not sure what the correct fix is... Add a PerformanceWarning in BaseMaskedArray.__init__ when copy=False and the input is a non-contiguous ndarray? That might help users, but it still seems easy to run into this. In fact, scanning the ASVs, I see a handful of places where EA-backed dataframes are constructed similar to df1 above.

The text was updated successfully, but these errors were encountered:

topper-123 · 2023-04-01T14:43:54Z

The assert_frame_equal is also exceptionally slow:

>>> %timeit pd.testing.assert_frame_equal(df1, df2)
10.5 s ± 110 ms per loop

topper-123 · 2023-04-01T14:55:08Z

data.flags.c_contiguous is True , i.e. the array is row-ordered.

We can probably do something like this inside the constructor:

>>> if data.flags.c_congiguous:
...     data = np.asfortranarray(data)
>>> data.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

and the columns-sliced arrays will now be contiguous, e.g.

>>> data = np.asfortranarray(data)
>>> data[:, 1].flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

lukemanley · 2023-04-01T15:00:04Z

My understanding is that np.asfortranarray(data) would copy in that case which would be at odds with copy=False. If we're ok with copying regardless, I think data.copy() also produces a contiguous array.

topper-123 · 2023-04-01T15:00:08Z

Yeah just tried this:

import pandas as pd
import numpy as np

mi = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
data = np.random.randn(len(mi), 5)
data = np.asfortranarray(data)    # new line!

df1 = pd.DataFrame(data, index=mi, dtype="Float64")
df2 = pd.DataFrame(data, index=mi).astype("Float64")

assert df1.equals(df2)

%timeit df1.unstack()
%timeit df2.unstack()

with results:

170 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
168 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

lukemanley · 2023-04-01T15:05:20Z

We can probably do something like this inside the constructor

Are you referring to the DataFrame constructor or the BaseMaskedArray constructor?

topper-123 · 2023-04-01T15:10:41Z

My understanding is that np.asfortranarray(data) would copy in that case which would be at odds with copy=False. If we're ok with copying regardless, I think data.copy() also produces a contiguous array.

Ok, haven't checked the code in detail. These slowdowns are very large, so IMO - if copy=None in the DataFrame constructor - we should accept copy/asfortranarray/something to avoid the slowdown, unless we find a way to do it without copy. Seems worth it to me. The docs should then be updated as well.

topper-123 · 2023-04-01T15:12:44Z

Are you referring to the DataFrame constructor or the BaseMaskedArray constructor?

Must be in the DataFrame constructor to transform the 2d array. right? BaseMaskedArray only deals in 1d arrays?

lukemanley · 2023-04-01T15:31:56Z

if you copy the 1D values passed into the BaseMaskedArray it would create a contiguous array:


import numpy as np

arr_2d = np.random.randn(10, 10)
arr_1d = arr_2d[:, 0]

print(arr_1d.flags.contiguous)  # False

arr_1d = arr_1d.copy()

print(arr_1d.flags.contiguous)  # True

topper-123 · 2023-04-01T15:39:23Z

if you copy the 1D values passed into the BaseMaskedArray it would create a contiguous array

True, but copy=False in BaseMaskedArray.__init__, so have to change the default to copy=None IMO, if we want that. I'm not decided what I think about that, but at least it seems reasonable to interpret copy=None in the DataFrame constructor as an ok to to a copy there (but need to change the docs to reflect it too).

phofl · 2023-04-01T16:19:11Z

There is another issue about this. This isn’t without trade offs since it would require a copy in the constructor which is something we’d like to avoid but could be introduced in the context of copy on write.

Could you post this there?

lukemanley · 2023-04-01T18:26:17Z

There is another issue about this. This isn’t without trade offs since it would require a copy in the constructor which is something we’d like to avoid but could be introduced in the context of copy on write.

Could you post this there?

Just searched, but not sure which issue you're referring to.

phofl · 2023-04-01T18:30:21Z

#50756 even though the scope is a bit different

topper-123 · 2023-04-01T20:31:46Z

I agree the scope is a bit different here: in cases with extension dtypes like here, data will always be sliced by columns, while in the numpy dtype case in #50756, the data might be stored in 2D arrays, which may benefit from c-contiguous arrays in some cases.

Not saying the above changes anything, but could mean that the treatment for extension dtypes could be different than for numpy dtypes. In any case, I think the performance costs for non-f-contiguous arrays are striking, and deserve a serious discussion of benefits vs costs of doing a potential asfortranarray or similar on ndarrays in constructors.

In any case I think we should discuss this issue parallel with #50756

phofl · 2023-04-01T20:34:36Z

The advantage is gone though as soon as you trigger an operation that forces a copy anyway. Which happens with most methods. That’s why we should view this in the context of copy on write where we have to create a copy anyway.

jbrockmendel · 2023-04-03T15:56:09Z

Some half-baked thoughts:

There are a few other issues that boil down to sub-optimal state (PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756 about numpy arrays, PERF: concat of pyarrow string array does not rechunk #42357 about pyarrow arrays, an old issue about consolidating blocks). We could lump all of these together into something like an "optimize" method.

1b) Exactly what optimizing means depends on what the user intends to do. Usually "axis 0 reductions" is a good guess. One advantage of laziness is not having to guess.

I am really uncomfortable with copy-at-construction defaults getting more complicated.
This problem would be avoided (or at least made no worse than the numpy case) with 2D EAs.

lukemanley added the Performance Memory or execution speed performance label Mar 31, 2023

lukemanley changed the title ~~PERF: DataFrame backed by non-contiguous masked EA~~ PERF: DataFrame backed by non-contiguous masked EAs Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DataFrame backed by non-contiguous masked EAs #52338

PERF: DataFrame backed by non-contiguous masked EAs #52338

lukemanley commented Mar 31, 2023 •

edited

Loading

topper-123 commented Apr 1, 2023

topper-123 commented Apr 1, 2023

lukemanley commented Apr 1, 2023

topper-123 commented Apr 1, 2023 •

edited

Loading

lukemanley commented Apr 1, 2023

topper-123 commented Apr 1, 2023

topper-123 commented Apr 1, 2023

lukemanley commented Apr 1, 2023

topper-123 commented Apr 1, 2023

phofl commented Apr 1, 2023

lukemanley commented Apr 1, 2023

phofl commented Apr 1, 2023

topper-123 commented Apr 1, 2023 •

edited

Loading

phofl commented Apr 1, 2023

jbrockmendel commented Apr 3, 2023

PERF: DataFrame backed by non-contiguous masked EAs #52338

PERF: DataFrame backed by non-contiguous masked EAs #52338

Comments

lukemanley commented Mar 31, 2023 • edited Loading

topper-123 commented Apr 1, 2023

topper-123 commented Apr 1, 2023

lukemanley commented Apr 1, 2023

topper-123 commented Apr 1, 2023 • edited Loading

lukemanley commented Apr 1, 2023

topper-123 commented Apr 1, 2023

topper-123 commented Apr 1, 2023

lukemanley commented Apr 1, 2023

topper-123 commented Apr 1, 2023

phofl commented Apr 1, 2023

lukemanley commented Apr 1, 2023

phofl commented Apr 1, 2023

topper-123 commented Apr 1, 2023 • edited Loading

phofl commented Apr 1, 2023

jbrockmendel commented Apr 3, 2023

lukemanley commented Mar 31, 2023 •

edited

Loading

topper-123 commented Apr 1, 2023 •

edited

Loading

topper-123 commented Apr 1, 2023 •

edited

Loading