Skip to content

PERF: avoid creating many Series in apply_standard #34909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 25, 2020
65 changes: 13 additions & 52 deletions pandas/core/apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,53 +266,6 @@ def apply_standard(self):
# partial result that may be returned from reduction
partial_result = None

# try to reduce first (by default)
# this only matters if the reduction in values is of different dtype
# e.g. if we want to apply to a SparseFrame, then can't directly reduce

# we cannot reduce using non-numpy dtypes,
# as demonstrated in gh-12244
if (
self.result_type in ["reduce", None]
and not self.dtypes.apply(is_extension_array_dtype).any()
# Disallow dtypes where setting _index_data will break
# ExtensionArray values, see GH#31182
and not self.dtypes.apply(lambda x: x.kind in ["m", "M"]).any()
# Disallow complex_internals since libreduction shortcut raises a TypeError
and not self.agg_axis._has_complex_internals
):

values = self.values
index = self.obj._get_axis(self.axis)
labels = self.agg_axis
empty_arr = np.empty(len(index), dtype=values.dtype)

# Preserve subclass for e.g. test_subclassed_apply
dummy = self.obj._constructor_sliced(
empty_arr, index=index, dtype=values.dtype
)

try:
result, reduction_success = libreduction.compute_reduction(
values, self.f, axis=self.axis, dummy=dummy, labels=labels
)
except TypeError:
# e.g. test_apply_ignore_failures we just ignore
if not self.ignore_failures:
raise
except ZeroDivisionError:
# reached via numexpr; fall back to python implementation
pass
else:
if reduction_success:
return self.obj._constructor_sliced(result, index=labels)

# no exceptions - however reduction was unsuccessful,
# use the computed function result for first element
partial_result = result[0]
if isinstance(partial_result, ABCSeries):
partial_result = partial_result.infer_objects()

# compute the result using the series generator,
# use the result computed while trying to reduce if available.
results, res_index = self.apply_series_generator(partial_result)
Expand Down Expand Up @@ -424,11 +377,19 @@ def apply_broadcast(self, target: "DataFrame") -> "DataFrame":

@property
def series_generator(self):
constructor = self.obj._constructor_sliced
return (
constructor(arr, index=self.columns, name=name)
for i, (arr, name) in enumerate(zip(self.values, self.index))
)
values = self.values
assert len(values) > 0

# We create one Series object, and will swap out the data inside
# of it. Kids: don't do this at home.
ser = self.obj._ixs(0, axis=0)
mgr = ser._mgr
blk = mgr.blocks[0]

for (arr, name) in zip(values, self.index):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you push this to an internals method instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im looking at that now. the other place where this pattern could be really useful is in groupby.ops, but its tougher there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure also exposing an api for this would be ok as well (eg another internals method)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im still troubleshooting the groupby.ops usage, would like to punt on making this an internals method for the time being

blk.values = arr
ser.name = name
yield ser

@property
def result_index(self) -> "Index":
Expand Down