-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add prod to masked_reductions #33442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Added a small comment
else: | ||
subset = values[~mask] | ||
if subset.size: | ||
if not mask.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a mask.all()
might on average be expensive here, since in most cases you will actually need the subset afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I see what you're saying. I'll switch that back
Can you also add prod to the existing whatsnew note about this? |
Hmm, looks like we are hitting an overflow issue in /tests/extension/base/reduce.test_reduce_series. This is interesting because I think the current implementation is only passing this test "by accident" since the input gets cast to float64 whenever it contains NA values (without this casting we would still get an overflow, which would be the case if the test input happened not to have any missing data): https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/integer.py#L571 Is this test actually correct (i.e., if an integer input would overflow, do we "expect" the output we would get if it was first cast as float)? https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/reduce.py#L19 |
@dsaxton shot in the dark maybe algorithms.checked_add_with_arr has something useful? |
I suppose this is just the test that was not correct. Also for int64 dtype (default numpy one), we let it silently overflow instead of casting to float: In [1]: a = np.arange(1, 101)
In [2]: pd.Series(a).prod()
Out[2]: 0
# no missing values -> use int algo from numpy
In [3]: pd.Series(a, dtype="Int64").prod()
Out[3]: 0
# on this branch, also this returns 0
In [4]: pd.Series(list(a) + [None], dtype="Int64").prod()
Out[4]: 93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248 So certainly having it differ depending on the presence of missing values or not as the current implementation gives is clearly a bug. |
pandas/core/arrays/boolean.py
Outdated
result = op(data, mask, skipna=skipna, **kwargs) | ||
|
||
# if we have numeric op that would result in an int, coerce to int if possible | ||
if name == "prod" and notna(result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you need to add this back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting a failure on some builds for \tests\arrays\boolean\test_reduction.py where we're checking that the product has int dtype so it might be needed:
elif op == "prod":
> assert isinstance(getattr(s, op)(), np.int64)
E AssertionError: assert False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then maybe the test needs to be edited. In any case, we should investigate why it is failing / what we should be expecting.
From a quick test, it seems that numpy returns int64:
In [6]: np.array([], dtype="bool").prod()
Out[6]: 1
In [7]: type(_)
Out[7]: numpy.int64
In [8]: np.array([True, False], dtype="bool").prod()
Out[8]: 0
In [9]: type(_)
Out[9]: numpy.int64
So I think we should follow that behaviour here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell it was something that was only affecting Windows builds: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=32986&view=logs&jobId=077026cf-93c0-54aa-45e0-9996ba75f6f7&j=077026cf-93c0-54aa-45e0-9996ba75f6f7&t=e95cf409-86ae-5b4d-6c5f-79395ef75e8f
pandas/core/arrays/boolean.py
Outdated
op = getattr(masked_reductions, name) | ||
return op(data, mask, skipna=skipna, **kwargs) | ||
result = op(data, mask, skipna=skipna, **kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try using maybe_cast_result_dtype
here (you will have to add the prod operation in side that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, that should not be needed. Numpy should give the desired result for booleans.
pandas/core/arrays/boolean.py
Outdated
result = op(data, mask, skipna=skipna, **kwargs) | ||
dtype = maybe_cast_result_dtype(dtype=data.dtype, how=name) | ||
if notna(result) and (dtype != result.dtype): | ||
result = result.astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure we actually need to do this. We could also choose to follow numpy's behaviour to return platform int (any idea what we do for "bool" dtype?)
BTW, this should also change the result for sum
, so this was not tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I think you're right and it was actually the test that needed to change here (np.int64 -> np.int_)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in any case that's what I did for sum, I see now (so if we decide on following numpy vs always returning int64, we should do it for both sum and prod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but I am fine with following numpy)
thanks @dsaxton |
Adding prod to /core/array_algos/masked_reductions.py and using them for IntegerArray and BooleanArray. This also seems to offer a decent speedup over nanops like the other reductions: