-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: de-duplicate nested-dict handling in DataFrame construction #41785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
oh i like this! |
This might have caused a big slowdown in one of the benchmarks: https://pandas.pydata.org/speed/pandas/#frame_ctor.FromDicts.time_nested_dict_int64?python=3.8&Cython=0.29.21 @jbrockmendel can you verify this? |
That looks plausible; noted in the OP we'd take a perf hit in these cases |
Yes, but "some overhead" turns out to be a 6x slowdown. |
do you have anything in particular in mind? could revert until after the release |
I didn't follow the original context, but IIUC, this PR didn't "fix" anything, correct? (like a bug) But it did remove a faster cython implementation to reduce maintenance cost in the idea the added overhead would only be small? If that's the case, and the overhead actually turns out to be not so small (6x according to the benchmark), then it's a tradeoff we need to make: do we find the speed worth it to maintain the cython implementation. |
This context is that this follows #41707, which did fix a bug in the Series constructor that didn't affect DataFrame construction because they had separate implementations. |
I opened #42248 to keep track of this regression. |
Co-authored-by: jbrockmendel <[email protected]>
This was the motivation for #41707. This will introduce some overhead, but will also prevent the behaviors from getting out of sync again.