Skip to content

PERF: Cythonize from_nested_dict #33485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
19 changes: 19 additions & 0 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import collections
from collections import abc
from decimal import Decimal
from fractions import Fraction
Expand Down Expand Up @@ -2526,3 +2527,21 @@ def fast_multiget(dict mapping, ndarray keys, default=np.nan):
output[i] = default

return maybe_convert_objects(output)


@cython.wraparound(False)
@cython.boundscheck(False)
def from_nested_dict(object data) -> dict:
cdef:
object new_data = collections.defaultdict(dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you set new_data to type dict new_data and make related changes below (use dict.setdefault etc.)? I think that should make fewer calls into Python, making this faster.

object index, column, value, dict_iterator
dict data_dct, nested_dict

data_dct = dict(data)

for index, dict_iterator in data_dct.items():
nested_dict = dict(dict_iterator)
Copy link
Contributor

@topper-123 topper-123 Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dict_iterator is a Series, this conversion will be slow. Probably best to just accept the user's choice IMO and accept a slowdown if dict_iterator is not a dict...

for column, value in nested_dict.items():
new_data[column][index] = value

return dict(new_data)
11 changes: 1 addition & 10 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1266,7 +1266,7 @@ def from_dict(cls, data, orient="columns", dtype=None, columns=None) -> "DataFra
if len(data) > 0:
# TODO speed up Series case
if isinstance(list(data.values())[0], (Series, dict)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your PR, but this line above is potentially very expensive. Can you change it to:

first_val = next(iter((data.values())), None)
if isinstance(first_val, (Series, dict)):

to avoid creating that list. Does this make a difference in your ASVs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! the ASV results are even better:

       before           after         ratio
     [c6c53671]       [bcb25b91]
     <master>         <TODO-cythonize>
-      1.03±0.1ms          748±6μs     0.72  series_methods.NanOps.time_func('sum', 1000000, 'int8')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

data = _from_nested_dict(data)
data = lib.from_nested_dict(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the above to

data = dict(data) if not type(data) is dict else data  # convert OrderedDict
data = lib.from_nested_dict(data)

you can change the function interface in lib to def from_nested_dict(dict data) -> dict:, which will simplify things and make it faster too.

else:
data, index = list(data.values()), list(data.keys())
elif orient == "columns":
Expand Down Expand Up @@ -8817,12 +8817,3 @@ def isin(self, values) -> "DataFrame":

ops.add_flex_arithmetic_methods(DataFrame)
ops.add_special_arithmetic_methods(DataFrame)


def _from_nested_dict(data):
# TODO: this should be seriously cythonized
new_data = collections.defaultdict(dict)
for index, s in data.items():
for col, v in s.items():
new_data[col][index] = v
return new_data