ENH: Add prod to masked_reductions #33442

dsaxton · 2020-04-09T20:14:40Z

Adding prod to /core/array_algos/masked_reductions.py and using them for IntegerArray and BooleanArray. This also seems to offer a decent speedup over nanops like the other reductions:

# Branch

[ins] In [3]: %timeit arr.prod()  # arr = pd.Series([None, 0, 1, 2] * 10_000, dtype="Int64")
102 µs ± 379 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

[ins] In [5]: %timeit arr.prod()  # arr = pd.Series([0, 0, 1, 2] * 10_000, dtype="Int64")
82.6 µs ± 752 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Master

[ins] In [4]: %timeit arr.prod()  # arr = pd.Series([None, 0, 1, 2] * 10_000, dtype="Int64")
291 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

[ins] In [6]: %timeit arr.prod()  # arr = pd.Series([0, 0, 1, 2] * 10_000, dtype="Int64")
78.6 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

jorisvandenbossche

Thanks!
Added a small comment

jorisvandenbossche · 2020-04-09T20:23:04Z

pandas/core/array_algos/masked_reductions.py

    else:
-        subset = values[~mask]
-        if subset.size:
+        if not mask.all():


Doing a mask.all() might on average be expensive here, since in most cases you will actually need the subset afterwards

Ah yes, I see what you're saying. I'll switch that back

jorisvandenbossche · 2020-04-09T20:26:15Z

Can you also add prod to the existing whatsnew note about this?

dsaxton · 2020-04-09T23:40:30Z

Hmm, looks like we are hitting an overflow issue in /tests/extension/base/reduce.test_reduce_series. This is interesting because I think the current implementation is only passing this test "by accident" since the input gets cast to float64 whenever it contains NA values (without this casting we would still get an overflow, which would be the case if the test input happened not to have any missing data): https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/integer.py#L571

Is this test actually correct (i.e., if an integer input would overflow, do we "expect" the output we would get if it was first cast as float)? https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/reduce.py#L19

jbrockmendel · 2020-04-09T23:55:53Z

@dsaxton shot in the dark maybe algorithms.checked_add_with_arr has something useful?

jorisvandenbossche · 2020-04-10T07:42:25Z

I suppose this is just the test that was not correct. Also for int64 dtype (default numpy one), we let it silently overflow instead of casting to float:

In [1]: a = np.arange(1, 101)  

In [2]: pd.Series(a).prod()   
Out[2]: 0

# no missing values -> use int algo from numpy
In [3]: pd.Series(a, dtype="Int64").prod()     
Out[3]: 0

# on this branch, also this returns 0
In [4]: pd.Series(list(a) + [None], dtype="Int64").prod()     
Out[4]: 93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248

So certainly having it differ depending on the presence of missing values or not as the current implementation gives is clearly a bug.

doc/source/whatsnew/v1.1.0.rst

pandas/core/array_algos/masked_reductions.py

pep8speaks · 2020-04-10T13:54:26Z

Hello @dsaxton! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-11 16:42:43 UTC

pandas/tests/extension/test_integer.py

jorisvandenbossche · 2020-04-10T13:58:04Z

pandas/core/arrays/boolean.py

+            result = op(data, mask, skipna=skipna, **kwargs)
+
+            # if we have numeric op that would result in an int, coerce to int if possible
+            if name == "prod" and notna(result):


why did you need to add this back?

I was getting a failure on some builds for \tests\arrays\boolean\test_reduction.py where we're checking that the product has int dtype so it might be needed:

elif op == "prod": > assert isinstance(getattr(s, op)(), np.int64) E AssertionError: assert False

Hmm, then maybe the test needs to be edited. In any case, we should investigate why it is failing / what we should be expecting.

From a quick test, it seems that numpy returns int64:

In [6]: np.array([], dtype="bool").prod() Out[6]: 1 In [7]: type(_) Out[7]: numpy.int64 In [8]: np.array([True, False], dtype="bool").prod() Out[8]: 0 In [9]: type(_) Out[9]: numpy.int64

So I think we should follow that behaviour here

From what I can tell it was something that was only affecting Windows builds: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=32986&view=logs&jobId=077026cf-93c0-54aa-45e0-9996ba75f6f7&j=077026cf-93c0-54aa-45e0-9996ba75f6f7&t=e95cf409-86ae-5b4d-6c5f-79395ef75e8f

jreback · 2020-04-10T17:11:13Z

pandas/core/arrays/boolean.py

            op = getattr(masked_reductions, name)
-            return op(data, mask, skipna=skipna, **kwargs)
+            result = op(data, mask, skipna=skipna, **kwargs)
+


can you try using maybe_cast_result_dtype here (you will have to add the prod operation in side that)

As far as I understand, that should not be needed. Numpy should give the desired result for booleans.

jorisvandenbossche · 2020-04-10T19:10:16Z

pandas/core/arrays/boolean.py

+            result = op(data, mask, skipna=skipna, **kwargs)
+            dtype = maybe_cast_result_dtype(dtype=data.dtype, how=name)
+            if notna(result) and (dtype != result.dtype):
+                result = result.astype(dtype)


I am still not sure we actually need to do this. We could also choose to follow numpy's behaviour to return platform int (any idea what we do for "bool" dtype?)

BTW, this should also change the result for sum, so this was not tested?

@jorisvandenbossche I think you're right and it was actually the test that needed to change here (np.int64 -> np.int_)

Yeah, in any case that's what I did for sum, I see now (so if we decide on following numpy vs always returning int64, we should do it for both sum and prod)

(but I am fine with following numpy)

jreback · 2020-04-12T22:59:06Z

thanks @dsaxton

REF: Add prod to masked_reductions

3e1f06c

dsaxton requested a review from jorisvandenbossche April 9, 2020 20:14

jorisvandenbossche changed the title ~~REF: Add prod to masked_reductions~~ ENH: Add prod to masked_reductions Apr 9, 2020

jorisvandenbossche reviewed Apr 9, 2020

View reviewed changes

jorisvandenbossche added Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance and removed Enhancement labels Apr 9, 2020

jorisvandenbossche added this to the 1.1 milestone Apr 9, 2020

Update

a7eb301

Change test

48c1432

jorisvandenbossche reviewed Apr 10, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

dsaxton added 2 commits April 10, 2020 08:53

Keep

c73d7b9

Merge remote-tracking branch 'upstream/master' into masked-prod

a429ced

jorisvandenbossche reviewed Apr 10, 2020

View reviewed changes

pandas/core/array_algos/masked_reductions.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Apr 10, 2020

View reviewed changes

pandas/tests/extension/test_integer.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Apr 10, 2020

View reviewed changes

dsaxton added 3 commits April 10, 2020 09:08

Update

b0c95fc

Move functions

c3e1763

Lint

6b66756

jreback requested changes Apr 10, 2020

View reviewed changes

dsaxton added 3 commits April 10, 2020 13:08

maybe_cast_result_dtype

58f7bd0

Lint

a2574df

Merge remote-tracking branch 'upstream/master' into masked-prod

474506b

jorisvandenbossche reviewed Apr 10, 2020

View reviewed changes

dsaxton added 5 commits April 11, 2020 11:16

Revert and change test

57551b8

Lint

1d25569

Merge remote-tracking branch 'upstream/master' into masked-prod

de5954a

Remove

2701fff

Lint

8321945

jorisvandenbossche approved these changes Apr 11, 2020

View reviewed changes

jreback approved these changes Apr 12, 2020

View reviewed changes

jreback merged commit 6658d89 into pandas-dev:master Apr 12, 2020

dsaxton deleted the masked-prod branch April 12, 2020 23:21

Uh oh!

ENH: Add prod to masked_reductions #33442

ENH: Add prod to masked_reductions #33442

Uh oh!

Conversation

dsaxton commented Apr 9, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 9, 2020

Uh oh!

dsaxton commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Apr 9, 2020

Uh oh!

jorisvandenbossche commented Apr 10, 2020

Uh oh!

Uh oh!

Uh oh!

pep8speaks commented Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-04-11 16:42:43 UTC

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 12, 2020

Uh oh!

Uh oh!

dsaxton commented Apr 9, 2020 •

edited

Loading

pep8speaks commented Apr 10, 2020 •

edited

Loading

jreback Apr 10, 2020 •

edited

Loading