Skip to content

BUG: df.to_dict(orient="records") significantly slower in Pandas 1.3.0 #42352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kyri-petrou opened this issue Jul 3, 2021 · 2 comments · Fixed by #42486
Closed

BUG: df.to_dict(orient="records") significantly slower in Pandas 1.3.0 #42352

kyri-petrou opened this issue Jul 3, 2021 · 2 comments · Fixed by #42486
Assignees
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@kyri-petrou
Copy link

Using df.to_dict(orient="records") with large dataframes is significantly slower in pandas 1.3.0 vs 1.2.5.

Could you please advice on what might be the cause of this issue?

Test dataframe

image

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100823 entries, 0 to 262141
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Class            100823 non-null  object 
 1   x                100823 non-null  float64
 2   y                100823 non-null  float64
 3   z                100823 non-null  float64
 4   rgb              100823 non-null  object 
 5   distance         100823 non-null  float64
 6   treecluster      100820 non-null  float64
 7   normal           100823 non-null  object 
 8   color            100823 non-null  object 
 9   rgb_distance     100823 non-null  object 
 10  responsibility   100820 non-null  object 
 11  vp_codes         100820 non-null  float64
 12  rgb_treecluster  100823 non-null  object 
dtypes: float64(6), object(7)
memory usage: 10.8+ MB

Simple timing test

image

Profiling

Pandas 1.2.5

         5547864 function calls (5547672 primitive calls) in 1.791 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1310699    0.568    0.000    0.831    0.000 cast.py:137(maybe_box_datetimelike)
  1411522    0.351    0.000    1.181    0.000 frame.py:1601(<genexpr>)
   100824    0.288    0.000    0.288    0.000 frame.py:1596(<genexpr>)
        1    0.280    0.280    1.759    1.759 frame.py:1600(<listcomp>)
  2621687    0.263    0.000    0.263    0.000 {built-in method builtins.isinstance}
        1    0.028    0.028    1.791    1.791 <string>:2(<module>)
   100823    0.010    0.000    0.010    0.000 {method 'items' of 'dict' objects}
     83/3    0.002    0.000    0.002    0.001 {built-in method _abc._abc_subclasscheck}
       13    0.000    0.000    0.001    0.000 indexing.py:782(_getitem_lowerdim)
       26    0.000    0.000    0.000    0.000 generic.py:5467(__setattr__)

Pandas 1.3.0

         35794233 function calls (35794206 primitive calls) in 15.844 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1310699    2.127    0.000   11.738    0.000 common.py:1578(_is_dtype_type)
   806581    2.013    0.000    5.492    0.000 base.py:425(find)
  1310696    1.942    0.000    8.020    0.000 common.py:1744(pandas_dtype)
  2218073    1.415    0.000    1.740    0.000 base.py:208(construct_from_string)
 12703947    1.410    0.000    1.410    0.000 {built-in method builtins.isinstance}
  1310699    1.066    0.000   14.387    0.000 cast.py:173(maybe_box_native)
  1310699    0.962    0.000   12.927    0.000 common.py:996(is_datetime_or_timedelta_dtype)
  1411522    0.598    0.000   14.985    0.000 frame.py:1823(<genexpr>)
        1    0.498    0.498   15.815   15.815 frame.py:1822(<listcomp>)
  1310699    0.344    0.000    0.535    0.000 common.py:146(<lambda>)
   100824    0.313    0.000    0.313    0.000 frame.py:1818(<genexpr>)
@kyri-petrou kyri-petrou added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 3, 2021
@mzeitlin11
Copy link
Member

Thanks for reporting and taking the time to profile this @kyri-petrou! Confirmed on current master with a simple reproducer like

rng = np.random.default_rng(0)

data = rng.integers(0, 1000, size=(10000, 10))
df = pd.DataFrame(data)
df.to_dict(orient="records")  # 0.73s on master, 0.12s on 1.2.5 (under profiling conditions)

Investigations welcome!

@mzeitlin11 mzeitlin11 added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 11, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.1 milestone Jul 11, 2021
@mzeitlin11
Copy link
Member

Looks due to #37648, will look further soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants