Skip to content

ENH: Support mask in groupby cumprod #48138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 19, 2022
31 changes: 28 additions & 3 deletions doc/source/whatsnew/v1.6.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,32 @@ These are bug fixes that might have notable behavior changes.

.. _whatsnew_160.notable_bug_fixes.notable_bug_fix1:

notable_bug_fix1
^^^^^^^^^^^^^^^^
:meth:`.GroupBy.cumsum` and :meth:`.GroupBy.cumprod` overflow instead of lossy casting to float
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions we cast to float when applying ``cumsum`` and ``cumprod`` which
lead to incorrect results even if the result could be hold by ``int64`` dtype.
Additionally, the aggregation overflows consistent with numpy when the limit of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe mention that it is making it consistent with the DataFrame method as well? (without groupby)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a reference to the methods

``int64`` is reached.

*Old Behavior*

.. code-block:: ipython

In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
In [2]: df.groupby("key")["value"].cumprod()[5]
Out[2]: 5.960464477539062e+16

We return incorrect results with the 6th value.

*New Behavior*

.. ipython:: python

df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
df.groupby("key")["value"].cumprod()

We overflow with the 7th value, but the 6th value is still correct.

.. _whatsnew_160.notable_bug_fixes.notable_bug_fix2:

Expand Down Expand Up @@ -102,7 +126,8 @@ Deprecations

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
- Performance improvement in :meth:`.GroupBy.cumprod` for extension array dtypes (:issue:`37493`)
- Performance improvement in :meth:`.GroupBy.median` and :meth:`.GroupBy.cumprod` for nullable dtypes (:issue:`37493`)
- Performance improvement in :meth:`MultiIndex.argsort` and :meth:`MultiIndex.sort_values` (:issue:`48406`)
- Performance improvement in :meth:`.GroupBy.mean` and :meth:`.GroupBy.var` for extension array dtypes (:issue:`37493`)
- Performance improvement for :meth:`Series.value_counts` with nullable dtype (:issue:`48338`)
- Performance improvement for :class:`Series` constructor passing integer numpy array with nullable dtype (:issue:`48338`)
Expand Down
7 changes: 0 additions & 7 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -226,13 +226,6 @@ def group_median_float64(
ptr += size


ctypedef fused int64float_t:
int64_t
uint64_t
float32_t
float64_t


@cython.boundscheck(False)
@cython.wraparound(False)
def group_cumprod(
Expand Down
19 changes: 17 additions & 2 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -643,13 +643,28 @@ def test_groupby_cumprod():
df = DataFrame({"key": ["b"] * 100, "value": 2})
df["value"] = df["value"].astype(float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe keep this with as int (or test both in addition), so we have a test for the silent overflow behaviour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test explicitly testing that overflow is consistent with numpy

actual = df.groupby("key")["value"].cumprod()
# if overflows, groupby product casts to float
# while numpy passes back invalid values
expected = df.groupby("key", group_keys=False)["value"].apply(lambda x: x.cumprod())
expected.name = "value"
tm.assert_series_equal(actual, expected)


def test_groupby_cumprod_overflow():
# GH#37493 if we overflow we return garbage consistent with numpy
df = DataFrame({"key": ["b"] * 4, "value": 100_000})
actual = df.groupby("key")["value"].cumprod()
expected = Series(
[100_000, 10_000_000_000, 1_000_000_000_000_000, 7766279631452241920],
name="value",
)
tm.assert_series_equal(actual, expected)

numpy_result = df.groupby("key", group_keys=False)["value"].apply(
lambda x: x.cumprod()
)
numpy_result.name = "value"
tm.assert_series_equal(actual, numpy_result)


def test_groupby_cumprod_nan_influences_other_columns():
# GH#48064
df = DataFrame(
Expand Down