-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: DataFrame backed by non-contiguous masked EAs #52338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The >>> %timeit pd.testing.assert_frame_equal(df1, df2)
10.5 s ± 110 ms per loop |
We can probably do something like this inside the constructor: >>> if data.flags.c_congiguous:
... data = np.asfortranarray(data)
>>> data.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False and the columns-sliced arrays will now be contiguous, e.g. >>> data = np.asfortranarray(data)
>>> data[:, 1].flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False |
My understanding is that |
Yeah just tried this: import pandas as pd
import numpy as np
mi = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000)])
data = np.random.randn(len(mi), 5)
data = np.asfortranarray(data) # new line!
df1 = pd.DataFrame(data, index=mi, dtype="Float64")
df2 = pd.DataFrame(data, index=mi).astype("Float64")
assert df1.equals(df2)
%timeit df1.unstack()
%timeit df2.unstack() with results:
|
Are you referring to the |
Ok, haven't checked the code in detail. These slowdowns are very large, so IMO - if |
Must be in the |
if you copy the 1D values passed into the
|
True, but |
There is another issue about this. This isn’t without trade offs since it would require a copy in the constructor which is something we’d like to avoid but could be introduced in the context of copy on write. Could you post this there? |
Just searched, but not sure which issue you're referring to. |
#50756 even though the scope is a bit different |
I agree the scope is a bit different here: in cases with extension dtypes like here, Not saying the above changes anything, but could mean that the treatment for extension dtypes could be different than for numpy dtypes. In any case, I think the performance costs for non-f-contiguous arrays are striking, and deserve a serious discussion of benefits vs costs of doing a potential In any case I think we should discuss this issue parallel with #50756 |
The advantage is gone though as soon as you trigger an operation that forces a copy anyway. Which happens with most methods. That’s why we should view this in the context of copy on write where we have to create a copy anyway. |
Some half-baked thoughts:
1b) Exactly what optimizing means depends on what the user intends to do. Usually "axis 0 reductions" is a good guess. One advantage of laziness is not having to guess.
|
Note the significant difference in performance below for "equal" dataframes:
timings:
The difference being:
df1 maintains the memory layout of the original 2D array, which results in non-contiguous 1D EA arrays.
df2 ends up being contiguous because
.astype
defaults tocopy=True
and the copy results in a contiguous layout for the 1D EA arrays.Not sure what the correct fix is... Add a
PerformanceWarning
inBaseMaskedArray.__init__
whencopy=False
and the input is a non-contiguous ndarray? That might help users, but it still seems easy to run into this. In fact, scanning the ASVs, I see a handful of places where EA-backed dataframes are constructed similar to df1 above.The text was updated successfully, but these errors were encountered: