Skip to content

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
derHeinzer opened this issue Jun 2, 2020 · 10 comments
Labels
Benchmark Performance (ASV) benchmarks Datetime Datetime data dtype good first issue Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@derHeinzer
Copy link

derHeinzer commented Jun 2, 2020

When adding a column to a DataFrame with one level having a DateTime-like dtype, the dtype of the values to be added is explicitly casted to object type in multi.py if the indexes of the values to be setted and the frames index are not identical in pandas version 1.0.3. Those object typed values are beiing transformed to Timestamps later on. This consumes a lot of time for big dataframes.

Comparing Pandas version 0.22.0 and 1.0.3 yields 0.124 seconds vs. 35.274 seconds on my machine on following reproducable setup:

build reproducable setup

iterables = [range(10000), pd.date_range('2020-01-01', periods=200)]
idx = pd.MultiIndex.from_product(iterables, names=['id', 'date'])
df = pd.DataFrame(data=np.random.randn(10000 * 200), index=idx, columns=["value"])
new_col = df[df.index.get_level_values(1) != pd.to_datetime('2020-01-01')] # drop first record of each id
print(df.shape, new_col.shape)

profile performance of set_item

import cProfile
pr = cProfile.Profile()
pr.enable()
df['new_col'] = new_col['value']
pr.disable()
pr.print_stats(sort=2)

@phofl
Copy link
Member

phofl commented Jun 2, 2020

Hey, thanks for your report.

Seems to persist on master. Took me 27 seconds to run this.

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   27.246   27.246 frame.py:2965(__setitem__)
        1    0.000    0.000   27.246   27.246 frame.py:3046(_set_item)
        1    0.000    0.000   27.245   27.245 frame.py:3636(_sanitize_column)
        1    0.000    0.000   27.245   27.245 frame.py:3656(reindexer)
        1    0.000    0.000   27.245   27.245 series.py:4286(reindex)
        1    0.002    0.002   27.245   27.245 generic.py:4236(reindex)
        1    0.000    0.000   27.243   27.243 generic.py:4481(_reindex_axes)
        1    0.000    0.000   27.239   27.239 multi.py:2272(reindex)
        1    0.000    0.000   27.073   27.073 multi.py:2429(get_indexer)
        1    2.044    2.044   27.073   27.073 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
        1    0.000    0.000   22.136   22.136 base.py:1034(__iter__)
        3    0.017    0.006   22.136    7.379 multi.py:640(_values)
        3    0.000    0.000   22.120    7.373 multi.py:645(values)
        1    0.000    0.000   20.054   20.054 extension.py:269(astype)
        1    0.000    0.000   20.054   20.054 datetimes.py:577(astype)
        1    0.000    0.000   20.054   20.054 datetimelike.py:613(astype)
        1    0.000    0.000   20.054   20.054 datetimelike.py:473(_box_values)
        1    2.044    2.044   20.054   20.054 {pandas._libs.lib.map_infer}
  2000000   14.756    0.000   18.009    0.000 datetimes.py:474(<lambda>)
        2    0.000    0.000    2.884    1.442 base.py:2938(get_indexer)
        5    0.024    0.005    2.867    0.573 base.py:5533(ensure_index)
     11/5    0.000    0.000    2.844    0.569 base.py:283(__new__)
  2000001    1.610    0.000    2.592    0.000 datetimes.py:498(tz)
        3    0.000    0.000    2.429    0.810 base.py:5700(_maybe_cast_data_without_dtype)
        1    2.047    2.047    2.047    2.047 {pandas._libs.lib.fast_zip}
        6    1.323    0.221    1.323    0.221 {pandas._libs.lib.infer_dtype}
        2    0.000    0.000    1.272    0.636 datetimes.py:297(_from_sequence)
        2    0.000    0.000    1.272    0.636 datetimes.py:1873(sequence_to_dt64ns)
        1    0.000    0.000    0.729    0.729 datetimes.py:2003(objects_to_datetime64ns)
        1    0.729    0.729    0.729    0.729 {built-in method pandas._libs.tslib.array_to_datetime}

Rest is omitted

@jreback
Copy link
Contributor

jreback commented Jun 2, 2020

there looks to be some unwanted conversions going on here. welcome for folks to have a look.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jun 2, 2020
@jreback
Copy link
Contributor

jreback commented Jun 2, 2020

you can work around this for now by:

In [40]: %time pd.merge(df.reset_index(), new_col.reset_index(), on=['id', 'date'], how='outer').set_index(['id', 'date'])                                 
CPU times: user 2.37 s, sys: 97.9 ms, total: 2.47 s
Wall time: 658 ms
Out[40]: 
                  value_x   value_y
id   date                          
0    2020-01-01  0.295513       NaN
     2020-01-02  0.274984  0.274984
     2020-01-03 -0.392445 -0.392445
     2020-01-04 -1.443801 -1.443801
     2020-01-05  0.897258  0.897258
...                   ...       ...
9999 2020-07-14  0.614013  0.614013
     2020-07-15  0.209266  0.209266
     2020-07-16  0.042773  0.042773
     2020-07-17 -0.657113 -0.657113
     2020-07-18 -1.629856 -1.629856

[2000000 rows x 2 columns]

@jreback jreback added this to the Contributions Welcome milestone Jun 2, 2020
@qiuwei
Copy link

qiuwei commented Apr 20, 2021

Also related to #23735

@qiuwei
Copy link

qiuwei commented Apr 20, 2021

@jreback
is there any way to get around this for reindex as well?
Thanks in advance!

EDIT:
I realized that the same trick could be applied to reindex operation:

pd.merge(df.to_frame(names=["id", "date"]), new_col.reset_index(), on=['id', 'date'], how='left').set_index(['id', 'date'])

@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

@qiuwei it would be helpful to look into actually what is happening and propose a patch

@mroeschke mroeschke added Performance Memory or execution speed performance Datetime Datetime data dtype labels Aug 7, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

This issue doesn't look to persist on main anymore. Could use an asv benchmark

@mroeschke mroeschke added good first issue Benchmark Performance (ASV) benchmarks and removed Performance Memory or execution speed performance labels Feb 2, 2024
@dshemetov
Copy link
Contributor

I'd like to take a stab at making an asv benchmark. Sometime this week.

@sumwun1
Copy link

sumwun1 commented May 31, 2024

First, does this issue still need work? It looks pretty old.

I've never contributed to an open source project before (besides fixing typos in documentation), but I do have 1 or 2 years of Python experience and I've read the "Contributing to pandas" page on the website. Is there anything I still need to do before I assign this to myself and start working?

@shriyase
Copy link

Hi, can i contribute to this issue if it's still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Datetime Datetime data dtype good first issue Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

8 participants