Skip to content

PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
34 changes: 16 additions & 18 deletions pandas/core/ops/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -310,23 +310,20 @@ def dispatch_to_series(left, right, func, str_rep=None, axis=None):
-------
DataFrame
"""
# Note: we use iloc to access columns for compat with cases
# with non-unique columns.
import pandas.core.computation.expressions as expressions
# Get the appropriate array-op to apply to each column/block's values.
array_op = get_array_op(func, str_rep=str_rep)

right = lib.item_from_zerodim(right)
if lib.is_scalar(right) or np.ndim(right) == 0:

# Get the appropriate array-op to apply to each block's values.
array_op = get_array_op(func, str_rep=str_rep)
bm = left._mgr.apply(array_op, right=right)
return type(left)(bm)

elif isinstance(right, ABCDataFrame):
assert right._indexed_same(left)

def column_op(a, b):
return {i: func(a.iloc[:, i], b.iloc[:, i]) for i in range(len(a.columns))}
arrays = []
for l, r in zip(left._iter_column_arrays(), right._iter_column_arrays()):
arrays.append(array_op(l, r))

elif isinstance(right, ABCSeries) and axis == "columns":
# We only get here if called via _combine_series_frame,
Expand All @@ -338,27 +335,28 @@ def column_op(a, b):
# Note: we do not do this unconditionally as it may be lossy or
# expensive for EA dtypes.
right = np.asarray(right)

def column_op(a, b):
return {i: func(a.iloc[:, i], b[i]) for i in range(len(a.columns))}

else:
right = right._values

def column_op(a, b):
return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt need to be in this PR, but since you're attention is already focused on this line: this can be incorrect if b is e.g. Categorical, since b.iloc[i] will have the scalar behavior instead of the Categorical (usually raising) behavior. (This is also a place where NaT causes issues, and the idea of only-one-NA causes me nightmares)

The main idea I've had here is to use b.iloc[[i]] instead of b.iloc[i], haven't gotten around to implementing it. Does this change make it easier to address the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using something like b.iloc[[i]], you would need to handle broadcasting here as well (since the array ops expect either same length array or scalar). So since that is a pre-existing issue, I would leave that for a dedicated PR.
Is there an issue for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think there's an issue for this. If I had found one, i would have added the Numeric label

arrays = []
for l, r in zip(left._iter_column_arrays(), right):
arrays.append(array_op(l, r))

elif isinstance(right, ABCSeries):
assert right.index.equals(left.index) # Handle other cases later
right = right._values

def column_op(a, b):
return {i: func(a.iloc[:, i], b) for i in range(len(a.columns))}
arrays = []
for l in left._iter_column_arrays():
arrays.append(array_op(l, right))

else:
# Remaining cases have less-obvious dispatch rules
raise NotImplementedError(right)

new_data = expressions.evaluate(column_op, str_rep, left, right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have a perf impact? IIRC a while back I tried removing it because I thought it shouldn't, but found that it did

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, the get_array_op should take care of this?

For example, the arrays_ops.py::na_arithmetic_op also calls expressions.evaluate to perform the actual op.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was what i thought too, and I could be remembering incorrectly. But this is worth double-checking.

return new_data
return type(left)._from_arrays(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can just return arrays here, since that will be passed to left._construct_result (though the docstring does say DataFrame, i guess thats wrong ATM)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to explicitly use _from_arrays, as that is part of the reason that makes this PR a perf improvement (the _from_arrays constructor specifically handles the case of a list of arrays corresponding to columns). Also, the default DataFrame constructor handles a list of arrays differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

arrays, left.columns, left.index, verify_integrity=False
)


# -----------------------------------------------------------------------------
Expand Down