BUG: BooleanArray.value_counts dropna #30824

TomAugspurger · 2020-01-08T21:49:10Z

jbrockmendel · 2020-01-08T21:55:07Z

LGTM pending green

jreback · 2020-01-09T02:31:43Z

pandas/tests/arrays/test_boolean.py

@@ -856,3 +856,10 @@ def test_arrow_roundtrip():
    result = table.to_pandas()
    assert isinstance(result["a"].dtype, pd.BooleanDtype)
    tm.assert_frame_equal(result, df)
+
+
+def test_value_counts_na():


is the result an object dtyped Series when dropna=True? (add a test as well)

In neither cases (dropna True of False) is the result an object dtype series, it is always integer (it are just counts).

That said, should the result here rather be a nullable integer type? Not that there are nulls here, but in the light of "trying to return nullable types as much as possible from operations involving nullable types".

hmm, yeah i think we should just move this to return a nullable integer (as this is new api). will promote consistency in the future.

Hmm, OK, will update these.

And so API breaking change for IntegerARrray.value_counts to return a nullalble int dtype too?

Can you expand on why you find it weird?

It's true that the result of a value_counts will always have no NAs, but returning a nullable int type prevents a reintroduction of NAs in subsequent operations from converting to float.

Can you expand on why you find it weird?

The motivation is to maintain consistency of "operations with nullable types return nullable types". But making value_counts().values return IntNA breaks the consistency of "values_counts().values is always np.int64". So it's a wash on "maintaining consistency".

Ideally we'd retain the dtype in the value_counts().index, and it seems like we're saying here "well we cant do that, so let's shoehorn the dtype into the values"

it seems like we're saying here "well we cant do that, so let's shoehorn the dtype into the values"

No, I don't think we're saying that. I think we're saying we find a nullable integer dtype to be more useful.

Not a hill I want to die on.

commented below

TomAugspurger · 2020-01-09T14:54:59Z

OK, updated {String,Integer,Boolean}Array.value_counts to return a nullable int64 dtype.

I moved BooleanArray.value_counts and IntegerArray.value_counts to the base class.

jorisvandenbossche

Minor comment, looks good for the rest

jorisvandenbossche · 2020-01-09T15:22:29Z

doc/source/whatsnew/v1.0.0.rst

+.. ipython:: python
+
+   >>> pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
+   Int64Dtype()


Don't show the output (or >>> prompt), or otherwise also make it a code-block

Whoops, thanks.

jreback · 2020-01-09T17:39:04Z

let me clarify my comments

i think the index should be nullable ints as that preserves the intent here
but agree the values should be int64 - this is a count after all

TomAugspurger · 2020-01-09T17:47:41Z

Thanks for clarifying.

The index can't be nullable yet. @jorisvandenbossche thoughts on the values dtype?

jreback · 2020-01-09T17:52:15Z

doc/source/whatsnew/v1.0.0.rst

@@ -411,6 +411,24 @@ Use :meth:`arrays.IntegerArray.to_numpy` with an explicit ``na_value`` instead.

   a.to_numpy(dtype="float", na_value=np.nan)

+**value_counts returns a nullable integer dtype**
+


actually, even though is a breaking change, it puts nullable in more general usage. so actually i think this is a good change.

jorisvandenbossche · 2020-01-09T19:01:37Z

Ideally we'd retain the dtype in the value_counts().index, and it seems like we're saying here "well we cant do that, so let's shoehorn the dtype into the values"

IMO, ideally we retain the nullable dtype in both the index and the values. But indeed, for index we can't do that yet, so for now it's only the values.

It seems weird that the values would be anything other than np.int64

We are still returning an int dtype. But in the "nullable dtype universe"-subsystem, the "int64" dtype is the nullable int64 dtype.

TomAugspurger · 2020-01-09T19:05:35Z

So I think that's some +1s and a -0 (#30824 (comment)). Merging in an hour or so unless @jbrockmendel objects.

jbrockmendel · 2020-01-09T19:12:34Z

No objection here.

TomAugspurger · 2020-01-09T19:19:37Z

Thanks.

BUG: BooleanArray.value_counts dropna

4d1abac

Closes pandas-dev#30685

TomAugspurger added this to the 1.0 milestone Jan 8, 2020

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 8, 2020

jreback reviewed Jan 9, 2020

View reviewed changes

TomAugspurger added 2 commits January 9, 2020 07:57

Merge remote-tracking branch 'upstream/master' into 30685-value_counts

b3796a5

fixup

604f170

remove unused numpy_dtype

12cfbeb

jorisvandenbossche reviewed Jan 9, 2020

View reviewed changes

doc fixup

67c0c06

jreback reviewed Jan 9, 2020

View reviewed changes

TomAugspurger merged commit 8bdd7b1 into pandas-dev:master Jan 9, 2020

TomAugspurger deleted the 30685-value_counts branch January 9, 2020 19:19

This was referenced Apr 6, 2020

BUG: value_counts Int64 zero-size array to reduction #33317

Closed

value_counts not working correctly on (some?) ExtensionArrays #33172

Closed

dsaxton mentioned this pull request Apr 26, 2020

BUG: value_counts not working correctly on ExtensionArrays #33674

Merged

5 tasks

mroeschke mentioned this pull request Mar 29, 2022

API: value_counts with nullable dtype should return np.int64 like everything else #44679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: BooleanArray.value_counts dropna #30824

BUG: BooleanArray.value_counts dropna #30824

TomAugspurger commented Jan 8, 2020

jbrockmendel commented Jan 8, 2020

jreback Jan 9, 2020

jorisvandenbossche Jan 9, 2020

jreback Jan 9, 2020

TomAugspurger Jan 9, 2020

TomAugspurger Jan 9, 2020

TomAugspurger Jan 9, 2020

jbrockmendel Jan 9, 2020

TomAugspurger Jan 9, 2020

jbrockmendel Jan 9, 2020

jreback Jan 9, 2020

TomAugspurger commented Jan 9, 2020

jorisvandenbossche left a comment

jorisvandenbossche Jan 9, 2020

TomAugspurger Jan 9, 2020

jreback commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020 •

edited

Loading

jreback Jan 9, 2020

jorisvandenbossche commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020

jbrockmendel commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020

		@@ -411,6 +411,24 @@ Use :meth:`arrays.IntegerArray.to_numpy` with an explicit ``na_value`` instead.

		a.to_numpy(dtype="float", na_value=np.nan)

		value_counts returns a nullable integer dtype

BUG: BooleanArray.value_counts dropna #30824

BUG: BooleanArray.value_counts dropna #30824

Conversation

TomAugspurger commented Jan 8, 2020

jbrockmendel commented Jan 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 9, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020

jbrockmendel commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020 •

edited

Loading