-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like this may be related to #9398. @behzadnouri any ideas? |
yes, #9398 will not scale well with a very wide frame; in exact same way that joining two frames on 1000000 columns will not scale well. for |
Thanks. column names are unique. moreover, it works when using only partial data (fewer rows) so i guess its related to the wide frame issue on the transposed data |
@eyaler are you transposing the frame? if you are transposing the frame before calling into |
after transposing the column names get the previous row numbers which are unique. |
@behzadnouri any thoughts on this? |
@jreback i guess the easiest soln would be to switch to the old code for wide frames if no subset is selected. |
@behzadnouri that sounds ok |
@behzadnouri would you be able to put up a fix for this in the coming days? As this is a regression, I think we should try to include a fix in 0.16.2, to be released this friday. |
This issue seems to still exist, running python version 2.7.6. On a relatively small dataframe - 5000 rows of mostly numeric data with a few date and short object columns (no duplicate columns)
In an older version of pandas (0.12.0) on the same dataframe
For now I reimplemented the old duplicated function and am calling it separately. I also tested this in python version 3.4 and had the same results |
The issue is still marked as open, though your timings are a bit odd. You might want to show
I see if I add a different dtype then this does slow down. |
Great, here is the info from the dataframe I used
|
i will patch this later today |
closed by #11180 |
Issue seems to persist on extremely wide pandas DataFrame with 79 columns. Columns with same name have been removed!
|
this is a long closed issue if u have a case |
the following works quickly in 0.15.2 and has a performance issue on the last operation df.T.duplicated() in 0.16.0 and 0.16.1
also on a private data set that works on 0.15.2 i get an error on 0.16.0 and 0.16.1 on the same operation.
code:
The text was updated successfully, but these errors were encountered: