Skip to content

REF: avoid broadcasting of Series to DataFrame in ops for ArrayManager #40482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

xref #39772

Currently, we broadcast a Series to a DataFrame for elementwise operations in the "op(df, series)" case. This is done (I think) to be able to perform the operation block-wise (which is only implemented for block/block case, and not for block/array).
But, when using the ArrayManager, this broadcasting is 1) not necessary (since we perform column-by-column anyway) and 2) costly.

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Numeric Operations Arithmetic, Comparison, and Logical operations Internals Related to non-user accessible pandas implementation labels Mar 17, 2021
@jbrockmendel
Copy link
Member

BTW i think the array manager build can be simplified with something like pytest pandas/tests/[abcdfglpstuw]*. Getting close!

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 31, 2021

I am running into the following issue:

In [10]: np.array([1, 2]) * pd.NaT
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-0274fc287e43> in <module>
----> 1 np.array([1, 2]) * pd.NaT

TypeError: unsupported operand type(s) for *: 'numpy.ndarray' and 'NaTType'

while

In [12]: np.array([1, 2]) * np.timedelta64("NaT", "ns")
Out[12]: array(['NaT', 'NaT'], dtype='timedelta64[ns]')

This happens if you do eg df * ser where the DataFrame has numerical columns, and the Series has timedelta64 dtype with NaTs.
I suppose this will also be a problem for the untyped pd.NA

(from pandas/tests/frame/test_arithmetic.py::TestFrameArithmetic::test_td64_op_nat_casting)

@@ -6675,6 +6678,9 @@ def _dispatch_frame_op(self, right, func: Callable, axis: Optional[int] = None):

right = right._values

if isinstance(right, TimedeltaArray):
right = right._data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this is AM-only right?
  2. wont the array_op call just re-wrap this?
  3. if not, will something like DataFrame[int] + Series[timedelta64ns] fail to raise because of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if this is a hack as the commit message suggests, then a comment would be worthwhile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #40482 (comment) for the general explanation of in which case this happens. It was not my intention to keep this, I am hoping to find a better solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah I saw that, thought about commenting, but decided that "last time i handled that by reshaping to 2D" wouldn't be something you'd find that helpful.

if isna(y):
mask = np.zeros(x.size, dtype=bool)
else:
mask = notna(xrav)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just an optimization, independent of the broadcasting thing?

@@ -291,6 +294,7 @@ def na_logical_op(x: np.ndarray, y, op):
y = ensure_object(y)
result = libops.vec_binop(x.ravel(), y.ravel(), op)
else:
x = ensure_object(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're doing this both here and on L293, then might as well do it just once after L289

Do we have cases that get here with x not object dtype?

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 9, 2021
@mroeschke
Copy link
Member

Appears this PR has been dormant for a while so closing. Feel free to reopen when you have time to work on this PR.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

@jorisvandenbossche if you can rebase

@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Numeric Operations Arithmetic, Comparison, and Logical operations Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants