Skip to content

REGR: to_dict(orient={"records", "split", "dict"}) slowdown #42486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 12, 2021

Conversation

mzeitlin11
Copy link
Member

Regression was due to each scalar value hitting is_datetime_or_timedelta_dtype -> _is_dtype_type where it falls most of the way through.

@mzeitlin11 mzeitlin11 added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Jul 11, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.1 milestone Jul 11, 2021
@jreback
Copy link
Contributor

jreback commented Jul 11, 2021

pls
run a lot of asvs

this is a very hot path and don't want to affect other things

@@ -196,6 +193,8 @@ def maybe_box_native(value: Scalar) -> Scalar:
value = int(value) # type: ignore[arg-type]
elif is_bool(value):
value = bool(value)
elif isinstance(value, (np.datetime64, np.timedelta64)):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this condition should be equivalent to previous. There was this comment and followup https://github.com/pandas-dev/pandas/pull/37648/files#r576461677, but the tests seem to contradict by mapping datetime -> datetime (behavior which isn't changed here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont you need datetime, timedelta here as well (e.g. python scalar types)? or i suppose these are already boxed right so we don't need them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly, these are already boxed correctly (and we look to have some test coverage for this, for example in pandas.tests.frame.methods.test_to_dict::TestDataFrameToDict::test_to_dict_tz)

@mzeitlin11
Copy link
Member Author

mzeitlin11 commented Jul 11, 2021

this is a very hot path and don't want to affect other things

EDIT: The function modified is maybe_box_native (added in #37648), and I can only find it hit in to_dict paths

to_dict asvs:

       before           after         ratio
     [87d78559]       [614a926a]
     <master>         <regr_to_dict_perf>
-        484±10ms          168±2ms     0.35  frame_methods.ToDict.time_to_dict_datetimelike('dict')
-         479±3ms          164±3ms     0.34  frame_methods.ToDict.time_to_dict_datetimelike('records')
-         465±4ms          159±9ms     0.34  frame_methods.ToDict.time_to_dict_datetimelike('split')
-         149±2ms         21.6±2ms     0.15  frame_methods.ToDict.time_to_dict_ints('records')
-         141±6ms       13.7±0.4ms     0.10  frame_methods.ToDict.time_to_dict_ints('dict')
-         135±6ms       12.9±0.4ms     0.10  frame_methods.ToDict.time_to_dict_ints('split')

@jreback
Copy link
Contributor

jreback commented Jul 11, 2021

no run all the asvs

or at the very least casting and construction ones

@mzeitlin11
Copy link
Member Author

no run all the asvs

or at the very least casting and construction ones

Sorry, please see comment above, just edited

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk perf looks fine (i see we dont' use this routine except where you are testing)

@@ -196,6 +193,8 @@ def maybe_box_native(value: Scalar) -> Scalar:
value = int(value) # type: ignore[arg-type]
elif is_bool(value):
value = bool(value)
elif isinstance(value, (np.datetime64, np.timedelta64)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont you need datetime, timedelta here as well (e.g. python scalar types)? or i suppose these are already boxed right so we don't need them?

@jreback jreback merged commit 5831b06 into pandas-dev:master Jul 12, 2021
@jreback
Copy link
Contributor

jreback commented Jul 12, 2021

thanks @mzeitlin11

@jreback
Copy link
Contributor

jreback commented Jul 12, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jul 12, 2021

Something went wrong ... Please have a look at my logs.

@mzeitlin11 mzeitlin11 deleted the regr_to_dict_perf branch July 12, 2021 12:46
jreback pushed a commit that referenced this pull request Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: df.to_dict(orient="records") significantly slower in Pandas 1.3.0
2 participants