Skip to content

ENH: Implement IntegerArray.sum #33538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 25, 2020
Merged

ENH: Implement IntegerArray.sum #33538

merged 8 commits into from
Apr 25, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Apr 14, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I think this is mostly interesting in that it allows normalize=True for value_counts on an IntegerArray backed Series, which currently doesn't work:

[ins] In [1]: s = pd.Series([1, 2, 3], dtype="Int64")

[ins] In [2]: s.value_counts(normalize=True)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-2bf1a78353e5> in <module>
----> 1 s.value_counts(normalize=True)

~/pandas/pandas/core/base.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1252             normalize=normalize,
   1253             bins=bins,
-> 1254             dropna=dropna,
   1255         )
   1256         return result

~/pandas/pandas/core/algorithms.py in value_counts(values, sort, ascending, normalize, bins, dropna)
    725
    726     if normalize:
--> 727         result = result / float(counts.sum())
    728
    729     return result

AttributeError: 'IntegerArray' object has no attribute 'sum'

@@ -573,6 +573,12 @@ def _reduce(self, name: str, skipna: bool = True, **kwargs):

return result

def sum(self, skipna: bool = True, min_count: int = 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should make the signature match PandasArray.sum etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, and we still want to keep axis (even though it wouldn't be used here)?

@@ -113,6 +113,17 @@ def test_value_counts_empty():
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("skipna", [True, False])
@pytest.mark.parametrize("min_count", [0, 4])
def test_integer_array_sum(skipna, min_count):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is e.g. Series[Int64].sum() or DataFrame[Int64].sum() fixed by this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are actually already working, just not for IntegerArray specifically

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the correct way to fix this. series/frames already dispatch to EAs via _reduce, which DO implement sum/prod/min/max. The issue is these functions need to be exposed via that dispatch mechanism, rather than writing more code to actually do the reduction.

@jreback
Copy link
Contributor

jreback commented Apr 15, 2020

furthermore the numpy_ EA do this in an opposite way, meaning that the ops themselves are defined, and _reduce dispatches TO them.

So we should decide on a single way forward here. I think the way integer array does this is better, e.g. in _reduce.

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations labels Apr 15, 2020
@jorisvandenbossche
Copy link
Member

As mentioned in #33351 (comment), we should probably have a general discussion about how to deal with those reduction methods in our EAs, otherwise those comments like Jeff's concern above (#33538) are going to keep coming up in each PR that adds something like this.

dtype=None,
out=None,
keepdims=False,
initial=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove all those unnecessary keywords and put them in a **kwargs ?

@jreback jreback added this to the 1.1 milestone Apr 24, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @jorisvandenbossche agree we agree on the issues you mentioned, but happy with this for now.

@jorisvandenbossche jorisvandenbossche merged commit 77a0f19 into pandas-dev:master Apr 25, 2020
@jorisvandenbossche
Copy link
Member

thanks @dsaxton !

@jorisvandenbossche
Copy link
Member

@dsaxton would you like to open a general issue about this? (#33538 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants