Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

derHeinzer · 2020-06-02T14:24:25Z

When adding a column to a DataFrame with one level having a DateTime-like dtype, the dtype of the values to be added is explicitly casted to object type in multi.py if the indexes of the values to be setted and the frames index are not identical in pandas version 1.0.3. Those object typed values are beiing transformed to Timestamps later on. This consumes a lot of time for big dataframes.

Comparing Pandas version 0.22.0 and 1.0.3 yields 0.124 seconds vs. 35.274 seconds on my machine on following reproducable setup:

build reproducable setup

iterables = [range(10000), pd.date_range('2020-01-01', periods=200)]
idx = pd.MultiIndex.from_product(iterables, names=['id', 'date'])
df = pd.DataFrame(data=np.random.randn(10000 * 200), index=idx, columns=["value"])
new_col = df[df.index.get_level_values(1) != pd.to_datetime('2020-01-01')] # drop first record of each id
print(df.shape, new_col.shape)

profile performance of set_item

import cProfile
pr = cProfile.Profile()
pr.enable()
df['new_col'] = new_col['value']
pr.disable()
pr.print_stats(sort=2)

phofl · 2020-06-02T16:14:19Z

Hey, thanks for your report.

Seems to persist on master. Took me 27 seconds to run this.

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   27.246   27.246 frame.py:2965(__setitem__)
        1    0.000    0.000   27.246   27.246 frame.py:3046(_set_item)
        1    0.000    0.000   27.245   27.245 frame.py:3636(_sanitize_column)
        1    0.000    0.000   27.245   27.245 frame.py:3656(reindexer)
        1    0.000    0.000   27.245   27.245 series.py:4286(reindex)
        1    0.002    0.002   27.245   27.245 generic.py:4236(reindex)
        1    0.000    0.000   27.243   27.243 generic.py:4481(_reindex_axes)
        1    0.000    0.000   27.239   27.239 multi.py:2272(reindex)
        1    0.000    0.000   27.073   27.073 multi.py:2429(get_indexer)
        1    2.044    2.044   27.073   27.073 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
        1    0.000    0.000   22.136   22.136 base.py:1034(__iter__)
        3    0.017    0.006   22.136    7.379 multi.py:640(_values)
        3    0.000    0.000   22.120    7.373 multi.py:645(values)
        1    0.000    0.000   20.054   20.054 extension.py:269(astype)
        1    0.000    0.000   20.054   20.054 datetimes.py:577(astype)
        1    0.000    0.000   20.054   20.054 datetimelike.py:613(astype)
        1    0.000    0.000   20.054   20.054 datetimelike.py:473(_box_values)
        1    2.044    2.044   20.054   20.054 {pandas._libs.lib.map_infer}
  2000000   14.756    0.000   18.009    0.000 datetimes.py:474(<lambda>)
        2    0.000    0.000    2.884    1.442 base.py:2938(get_indexer)
        5    0.024    0.005    2.867    0.573 base.py:5533(ensure_index)
     11/5    0.000    0.000    2.844    0.569 base.py:283(__new__)
  2000001    1.610    0.000    2.592    0.000 datetimes.py:498(tz)
        3    0.000    0.000    2.429    0.810 base.py:5700(_maybe_cast_data_without_dtype)
        1    2.047    2.047    2.047    2.047 {pandas._libs.lib.fast_zip}
        6    1.323    0.221    1.323    0.221 {pandas._libs.lib.infer_dtype}
        2    0.000    0.000    1.272    0.636 datetimes.py:297(_from_sequence)
        2    0.000    0.000    1.272    0.636 datetimes.py:1873(sequence_to_dt64ns)
        1    0.000    0.000    0.729    0.729 datetimes.py:2003(objects_to_datetime64ns)
        1    0.729    0.729    0.729    0.729 {built-in method pandas._libs.tslib.array_to_datetime}

Rest is omitted

jreback · 2020-06-02T23:20:01Z

there looks to be some unwanted conversions going on here. welcome for folks to have a look.

jreback · 2020-06-02T23:22:14Z

you can work around this for now by:

In [40]: %time pd.merge(df.reset_index(), new_col.reset_index(), on=['id', 'date'], how='outer').set_index(['id', 'date'])                                 
CPU times: user 2.37 s, sys: 97.9 ms, total: 2.47 s
Wall time: 658 ms
Out[40]: 
                  value_x   value_y
id   date                          
0    2020-01-01  0.295513       NaN
     2020-01-02  0.274984  0.274984
     2020-01-03 -0.392445 -0.392445
     2020-01-04 -1.443801 -1.443801
     2020-01-05  0.897258  0.897258
...                   ...       ...
9999 2020-07-14  0.614013  0.614013
     2020-07-15  0.209266  0.209266
     2020-07-16  0.042773  0.042773
     2020-07-17 -0.657113 -0.657113
     2020-07-18 -1.629856 -1.629856

[2000000 rows x 2 columns]

qiuwei · 2021-04-20T07:12:21Z

Also related to #23735

qiuwei · 2021-04-20T07:13:40Z

@jreback
is there any way to get around this for reindex as well?
Thanks in advance!

EDIT:
I realized that the same trick could be applied to reindex operation:

pd.merge(df.to_frame(names=["id", "date"]), new_col.reset_index(), on=['id', 'date'], how='left').set_index(['id', 'date'])

jreback · 2021-04-20T10:51:17Z

@qiuwei it would be helpful to look into actually what is happening and propose a patch

mroeschke · 2024-02-02T20:09:03Z

This issue doesn't look to persist on main anymore. Could use an asv benchmark

dshemetov · 2024-05-06T18:14:26Z

I'd like to take a stab at making an asv benchmark. Sometime this week.

sumwun1 · 2024-05-31T22:47:31Z

First, does this issue still need work? It looks pretty old.

I've never contributed to an open source project before (besides fixing typos in documentation), but I do have 1 or 2 years of Python experience and I've read the "Contributing to pandas" page on the website. Is there anything I still need to do before I assign this to myself and start working?

shriyase · 2024-06-21T20:06:00Z

Hi, can i contribute to this issue if it's still open?

jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jun 2, 2020

jreback added this to the Contributions Welcome milestone Jun 2, 2020

This was referenced Apr 21, 2021

New engine for MultiIndex? #18519

Closed

CLN: make DTA/TDA check clearer #33665

Merged

WIP: avoid zip and unzip to improve the speed of get_indexer #41177

Closed

mroeschke added Performance Memory or execution speed performance Datetime Datetime data dtype labels Aug 7, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

mroeschke added good first issue Benchmark Performance (ASV) benchmarks and removed Performance Memory or execution speed performance labels Feb 2, 2024

mroeschke closed this as completed Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

derHeinzer commented Jun 2, 2020 •

edited

Loading

phofl commented Jun 2, 2020 •

edited

Loading

jreback commented Jun 2, 2020

jreback commented Jun 2, 2020

qiuwei commented Apr 20, 2021

qiuwei commented Apr 20, 2021 •

edited

Loading

jreback commented Apr 20, 2021

mroeschke commented Feb 2, 2024

dshemetov commented May 6, 2024

sumwun1 commented May 31, 2024 •

edited

Loading

shriyase commented Jun 21, 2024

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

Comments

derHeinzer commented Jun 2, 2020 • edited Loading

build reproducable setup

profile performance of set_item

phofl commented Jun 2, 2020 • edited Loading

jreback commented Jun 2, 2020

jreback commented Jun 2, 2020

qiuwei commented Apr 20, 2021

qiuwei commented Apr 20, 2021 • edited Loading

jreback commented Apr 20, 2021

mroeschke commented Feb 2, 2024

dshemetov commented May 6, 2024

sumwun1 commented May 31, 2024 • edited Loading

shriyase commented Jun 21, 2024

derHeinzer commented Jun 2, 2020 •

edited

Loading

phofl commented Jun 2, 2020 •

edited

Loading

qiuwei commented Apr 20, 2021 •

edited

Loading

sumwun1 commented May 31, 2024 •

edited

Loading