-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: df.__setitem__ can be 10x slower than pd.concat(..., axis=1) #37954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, thanks for your report. We might have a different problem here. The setitem example raises on master for me.
cc @jbrockmendel Edit: As reported, works on 1.1.4 |
I'll take a closer look when I get a chance. I took some time to figure out what was going on, but made a wrong turn. Right now, it looks like the majority of the time is from new = pd.DataFrame(index=df.index)
%timeit new = pd.DataFrame(index=df.index); new[x_col] = x
3.26 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new = pd.DataFrame(index=df.index)
%timeit new = pd.DataFrame(index=df.index); new[df.columns] = df
636 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) However, it looks like in both cases the code falls back to looping over the keys and adding the columns individually. Surprisingly (to me), for the array case Additionally, it looks like
profile of `concat`
profile of `setitem`
|
I found this surprising too a few weeks ago and changed it to avoid a double-allocation, but it looks like that is what caused the bug that @phofl is fixing for us. Even with that fix in, I think it will still be iterating over columns, so perf will leave something to be desired. |
For the |
Expanded %%prun -s cumtime
for _ in range(1000):
setitem(x, x_col, df)
%%prun -s cumtime
for _ in range(1000):
concat(x, x_col, df)
|
@phofl im thinking we should try to address this by avoiding the for-loop in _ensure_listlike_indexer, what do you think? |
@jbrockmendel |
Three ideas come to mind:
|
you need to reindex to the new key set |
@jreback Wouldn't reindex create a copy instead of inplace modifications? |
oh thats a good question. might be able to avoid it by passing only_slice=True to _slice_take_blocks_ax0, but would need to add that kwarg to reindex_indexer and reindex_axis |
Does not seem to avoid a copy, but I think performance is high enough now there, that this does not matter that much. Thanks for explaining |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
This seems unintuitive from a performance perspective. I would assume that these would be close to equivalent. The
setitem
implementation might even be expected to have better performance due to less allocation.While this is a stripped down example, the use case is building a dataframe to return from a function. Making a dataframe, then adding columns seemed like the natural idiom here.
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: