-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey, thanks for your report. Seems to persist on master. Took me 27 seconds to run this.
Rest is omitted |
there looks to be some unwanted conversions going on here. welcome for folks to have a look. |
you can work around this for now by:
|
Also related to #23735 |
@jreback EDIT:
|
@qiuwei it would be helpful to look into actually what is happening and propose a patch |
This issue doesn't look to persist on main anymore. Could use an asv benchmark |
I'd like to take a stab at making an asv benchmark. Sometime this week. |
First, does this issue still need work? It looks pretty old. I've never contributed to an open source project before (besides fixing typos in documentation), but I do have 1 or 2 years of Python experience and I've read the "Contributing to pandas" page on the website. Is there anything I still need to do before I assign this to myself and start working? |
Hi, can i contribute to this issue if it's still open? |
When adding a column to a DataFrame with one level having a DateTime-like dtype, the dtype of the values to be added is explicitly casted to object type in multi.py if the indexes of the values to be setted and the frames index are not identical in pandas version 1.0.3. Those object typed values are beiing transformed to Timestamps later on. This consumes a lot of time for big dataframes.
Comparing Pandas version 0.22.0 and 1.0.3 yields 0.124 seconds vs. 35.274 seconds on my machine on following reproducable setup:
build reproducable setup
iterables = [range(10000), pd.date_range('2020-01-01', periods=200)]
idx = pd.MultiIndex.from_product(iterables, names=['id', 'date'])
df = pd.DataFrame(data=np.random.randn(10000 * 200), index=idx, columns=["value"])
new_col = df[df.index.get_level_values(1) != pd.to_datetime('2020-01-01')] # drop first record of each id
print(df.shape, new_col.shape)
profile performance of set_item
import cProfile
pr = cProfile.Profile()
pr.enable()
df['new_col'] = new_col['value']
pr.disable()
pr.print_stats(sort=2)
The text was updated successfully, but these errors were encountered: