Skip to content

BUG: value_counts not working correctly on ExtensionArrays #33674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 1, 2020

Conversation

kotamatsuoka
Copy link
Contributor

@kotamatsuoka kotamatsuoka commented Apr 20, 2020

@simonjayhawkins simonjayhawkins added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 20, 2020
@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from 949ef4f to ea5f7c4 Compare April 20, 2020 14:56
@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from ea5f7c4 to b025232 Compare April 20, 2020 15:35
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note as well, you can put it in numeric bug fix section

@simonjayhawkins
Copy link
Member

This could be an opportunity to use the more performant sum function from core\array_algos\masked_reductions.py and add a sum method to IntegerArray (or maybe BaseMaskedArray) There is a precedent for adding sum methods to extension arrays (timedeltas and sparse) but not sure how this fits in with the EA interface. @jorisvandenbossche?

@jreback jreback added this to the 1.1 milestone Apr 21, 2020
@pep8speaks
Copy link

pep8speaks commented Apr 21, 2020

Hello @kotamatsuoka! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-30 23:45:49 UTC

@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from c6c4751 to 67c83e9 Compare April 21, 2020 13:57
@kotamatsuoka kotamatsuoka requested a review from jreback April 21, 2020 13:58
@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch 2 times, most recently from 957d105 to 4235020 Compare April 21, 2020 16:32
@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from 4235020 to 7007205 Compare April 22, 2020 14:07
@dsaxton
Copy link
Member

dsaxton commented Apr 23, 2020

This could be an opportunity to use the more performant sum function from core\array_algos\masked_reductions.py and add a sum method to IntegerArray (or maybe BaseMaskedArray) There is a precedent for adding sum methods to extension arrays (timedeltas and sparse) but not sure how this fits in with the EA interface. @jorisvandenbossche?

@simonjayhawkins That's the approach I was taking here although am not sure at the moment how best to implement this: #33538

@simonjayhawkins
Copy link
Member

hmm, didn't realise that this PR overlaps with #33538.

@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from 2d857d1 to dac1af5 Compare April 24, 2020 21:53
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kotamatsuoka for working on this.

the bug is fixed on master, see #33172 (comment)

just needs tests.

@simonjayhawkins simonjayhawkins added Testing pandas testing functions or related to the test suite and removed Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 25, 2020
@kotamatsuoka kotamatsuoka requested a review from jreback April 26, 2020 06:42
@@ -290,7 +290,7 @@ def _reduce(self, name, skipna=True, **kwargs):
def value_counts(self, dropna=False):
from pandas import value_counts

return value_counts(self._ndarray, dropna=dropna).astype("Int64")
return value_counts(self._ndarray, dropna=dropna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this back to what it was on master (will need to change the string tests back as well)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If revert, value_counts method return different dtype and need to change many tests. (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsaxton why/where were these changed to Int64 in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it was added here #30824

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback this is converted to Int64 on purpose, in the idea that operations on nullable columns should, to the extent possible, also result in new nullable dtypes (similarly as doing a comparison operation with string dtype will give a nullable boolean dtype)

@@ -290,7 +290,7 @@ def _reduce(self, name, skipna=True, **kwargs):
def value_counts(self, dropna=False):
from pandas import value_counts

return value_counts(self._ndarray, dropna=dropna).astype("Int64")
return value_counts(self._ndarray, dropna=dropna)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsaxton why/where were these changed to Int64 in the first place?

@jreback
Copy link
Contributor

jreback commented Apr 26, 2020

this also needs a whatsnew note, add in bug fixes, extension array section in 1.1

@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from 4e283ff to badbaa4 Compare April 27, 2020 13:59
@@ -199,6 +199,10 @@ class TestMethods(BaseNumPyTests, base.BaseMethodsTests):
def test_value_counts(self, all_data, dropna):
pass

@pytest.mark.skip(reason="value_counts not implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be skip or xfail?

@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from badbaa4 to bf91882 Compare April 27, 2020 15:16
@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from bf91882 to 67ee980 Compare April 28, 2020 11:20
@jorisvandenbossche
Copy link
Member

@kotamatsuoka the base test that you added is now being skipped for several ExtensionArray classes (which is fine, to be clear), but I think that we should then test it specifically for at least some of those classes with a custom test (I would at least test it for IntegerArray and BooleanArray).

So instead of xfailing or skipping the test in test_boolean.py and test_integer.py, I would actually have a custom version of the test there (that has a different expected value, to match the behaviour).

(or is this already tested elsewhere?)

@kotamatsuoka
Copy link
Contributor Author

Thanks @jorisvandenbossche .
I updated test_value_counts_with_normalize(), so xfailing or skipping the test in test_boolean.py , test_integer.py and test_sparse.py are no longer needed。

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green.

@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch 5 times, most recently from cdc2f09 to 3473a42 Compare April 30, 2020 23:11
@kotamatsuoka kotamatsuoka force-pushed the normalize-value-counts branch from 3473a42 to 5d54a05 Compare April 30, 2020 23:45
@jreback jreback merged commit c00293d into pandas-dev:master May 1, 2020
@jreback
Copy link
Contributor

jreback commented May 1, 2020

thanks @kotamatsuoka

@kotamatsuoka kotamatsuoka deleted the normalize-value-counts branch May 1, 2020 00:58
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

value_counts not working correctly on (some?) ExtensionArrays
7 participants