API: value_counts with nullable dtype should return np.int64 like everything else #44679

jbrockmendel · 2021-11-30T00:39:10Z

ATM these are special-cased to return Int64 (i.e. nullable) instead of np.int64. But the result of value_counts will never have any NAs, so there is no benefit. It complicates the code, complicates the API, and prevents us from sharing tests.

These should return np.int64 dtype like everything else.

jorisvandenbossche · 2021-11-30T13:09:46Z

The return value of value_counts indeed never contains NAs, but further operations with this result still can introduce missing values. And then the return type of value_counts does matter.

The nullable dtypes are optional, but once opting in, we should IMO keep using them as much as possible for results (eg also fillna keeps the nullable dtype, although the result has no NAs)

jreback · 2021-12-22T22:39:58Z

i agree with @jbrockmendel here. if we know that a type is int64 by-definition / always. I don't think we should just return it for this opearation. The simplication argument is persuasive.

jorisvandenbossche · 2021-12-23T19:40:38Z

It is by definition an integer dtype with 64 bitwidth, but whether we choose Int64 or int64 is just an API choice (Int64 is also an int64 dtype).
There are quite some other operations where we preserve nullable dtypes, even if there are no missing values.

As mentioned above, while the return value of value_counts itself doesn't have missing values, follow-up operations in your pipeline could introduce them in which case the actual dtype makes a difference. A small example:

>>> s = pd.Series([1, 2, 1, 2, 4], dtype="Int64")
>>> s
0    1
1    2
2    1
3    2
4    4
dtype: Int64
>>> s.value_counts().reindex(list(range(5)))
0    <NA>
1       2
2       2
3    <NA>
4       1
dtype: Int64
>>> s.value_counts().reindex(list(range(5))).fillna(0)
0    0
1    2
2    2
3    0
4    1
dtype: Int64

If we return numpy.int64, the last two examples would be float data.

IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.

jreback · 2021-12-23T20:35:22Z

IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.

sure but this point is not relevant

what is relevant is that we should just pick a return type

it's pretty crazy that the output type is different here

so we need to either pick int64 or Int64 for the return value always

mroeschke · 2022-03-21T16:53:13Z

I could see there being a consistency argument to return Int64 if there's a strong push to make the nullable numeric types the return type for all pandas operations eventually.

If not (or not in the near future), I think the simplicity of return np.int64 makes sense and the option of casting with astype("Int64") post value_counts could be left to the user.

rhshadrach · 2022-03-28T22:32:52Z

Once a user has opted into nullable dtypes, it feels expected to me for pandas to continue to use nullable dtypes even if it doesn't have to (e.g. fillna). I think this is the way most ops work although admittedly, many of them can have a result with null values and so maybe "shouldn't count". By always returning int64, it seems to me we'd be creating special-cased behavior because of the peculiarities of the op itself, something which users may find surprising.

I do agree that we can have a user cast back to nullable dtypes if necessary, but I don't think of this as a preferred solution. It makes trying to use nullable dtypes more of a hassel.

mroeschke · 2022-03-29T01:03:33Z

This looks to be the PR & discussion where Int64 return type was made: #30824

It appears the motivation was to promote the new nullable dtype back in 1.0

jbrockmendel · 2023-02-04T21:28:56Z

closing as never-gonna-happen

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2021

jbrockmendel mentioned this issue Nov 30, 2021

API: value_counts consistently return int64 dtype #44692

Closed

4 tasks

jbrockmendel mentioned this issue Mar 21, 2022

REF: Move value_counts, take, factorize to ArrowExtensionArray #46453

Merged

2 tasks

jbrockmendel closed this as completed Feb 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: value_counts with nullable dtype should return np.int64 like everything else #44679

API: value_counts with nullable dtype should return np.int64 like everything else #44679

jbrockmendel commented Nov 30, 2021 •

edited

Loading

jorisvandenbossche commented Nov 30, 2021

jreback commented Dec 22, 2021

jorisvandenbossche commented Dec 23, 2021

jreback commented Dec 23, 2021

mroeschke commented Mar 21, 2022

rhshadrach commented Mar 28, 2022

mroeschke commented Mar 29, 2022

jbrockmendel commented Feb 4, 2023

API: value_counts with nullable dtype should return np.int64 like everything else #44679

API: value_counts with nullable dtype should return np.int64 like everything else #44679

Comments

jbrockmendel commented Nov 30, 2021 • edited Loading

jorisvandenbossche commented Nov 30, 2021

jreback commented Dec 22, 2021

jorisvandenbossche commented Dec 23, 2021

jreback commented Dec 23, 2021

mroeschke commented Mar 21, 2022

rhshadrach commented Mar 28, 2022

mroeschke commented Mar 29, 2022

jbrockmendel commented Feb 4, 2023

jbrockmendel commented Nov 30, 2021 •

edited

Loading