Skip to content

PERF/REF: Check use of numexpr earlier in the DataFrame operation #41122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

This PR tries to gather some checks about whether numexpr can be used or not which can be done upfront on the DataFrame level, before dispatching to the array_ops.
For example, for a scalar right operand, we can check it only once if it's compatible with numexpr usage. Or for ArrayManager, we can check the size constraint once upfront (since each array has the same size). For those cases, this avoids overhead in expressions.evaluate.

To this end, I created a new can_use_numexpr that starts to gather some checks (there might be more that can be added), call this up front in DataFrame._dispatch_frame_op (this could IMO be bit cleaner with #39772, but the if/else checks inside _dispatch_frame_op work as well). And then the result of can_use_numexpr is passed through to the array_ops.

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Numeric Operations Arithmetic, Comparison, and Logical operations labels Apr 23, 2021
@jorisvandenbossche
Copy link
Member Author

There failing tests related to warning/error not being raised for boolean arithmetic ops, but (more easily) fixing those tests depends on #41161.
And I would also first like to see #40463 merged for better test coverage.

----------
op : operator
size : int
dtypes : list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list[DtypeObj]?

else:
use_numexpr = expressions.can_use_numexpr(func, None, None, right)

array_op = ops.get_array_op(func, use_numexpr=use_numexpr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the get_array_op on L6832 still needed?

@@ -6866,6 +6886,17 @@ def _dispatch_frame_op(self, right, func: Callable, axis: int | None = None):
# maybe_align_as_frame ensures we do not have an ndarray here
assert not isinstance(right, np.ndarray)

if isinstance(self._mgr, ArrayManager):
use_numexpr = expressions.can_use_numexpr(
func, self.shape[1], (right.dtype,), None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does self.shape[1] correspond to the size of some array?

else:
use_numexpr = expressions.can_use_numexpr(
func, None, (right.dtype,), None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these ~8 lines (x4) seem like a reasonable cope for an AM/BM method

@@ -135,7 +135,7 @@ def _masked_arith_op(x: np.ndarray, y, op):
return result


def _na_arithmetic_op(left, right, op, is_cmp: bool = False):
def _na_arithmetic_op(left, right, op, is_cmp: bool = False, use_numexpr=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotate, docstring

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use_numexpr be keyword-only

@jbrockmendel
Copy link
Member

any estimates for perf impact?

@simonjayhawkins simonjayhawkins added the Performance Memory or execution speed performance label May 25, 2021
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 25, 2021
@simonjayhawkins
Copy link
Member

There failing tests related to warning/error not being raised for boolean arithmetic ops, but (more easily) fixing those tests depends on #41161.
And I would also first like to see #40463 merged for better test coverage.

#41161 and #40463 have been merged. can you address @jbrockmendel comments.

Parameters
----------
op : operator
size : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int or None?

def _can_use_numexpr(op, op_str, a, b, dtype_check):
""" return a boolean if we WILL be using numexpr """
if op_str is not None:

# required min elements (otherwise we are adding overhead)
if a.size > _MIN_ELEMENTS:
if isinstance(b, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the place where this is moved from has a comment # can never use numexpr. not a huge loss since the comment isn't that informative, but at some point we should get a helpful comment about why it cant be used

@mroeschke
Copy link
Member

Thanks for the PR but appears to have gone stale. Closing to clear the queue, but feel free to reopen if you find the time to revisit this PR and the comments.

@mroeschke mroeschke closed this Nov 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance Refactor Internal refactoring of code Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants