Skip to content

API: value_counts with nullable dtype should return np.int64 like everything else #44679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Nov 30, 2021 · 8 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API - Consistency Internal Consistency of API/Behavior NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 30, 2021

ATM these are special-cased to return Int64 (i.e. nullable) instead of np.int64. But the result of value_counts will never have any NAs, so there is no benefit. It complicates the code, complicates the API, and prevents us from sharing tests.

These should return np.int64 dtype like everything else.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2021
@lithomas1 lithomas1 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API - Consistency Internal Consistency of API/Behavior NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2021
@jorisvandenbossche
Copy link
Member

The return value of value_counts indeed never contains NAs, but further operations with this result still can introduce missing values. And then the return type of value_counts does matter.

The nullable dtypes are optional, but once opting in, we should IMO keep using them as much as possible for results (eg also fillna keeps the nullable dtype, although the result has no NAs)

@jreback
Copy link
Contributor

jreback commented Dec 22, 2021

i agree with @jbrockmendel here. if we know that a type is int64 by-definition / always. I don't think we should just return it for this opearation. The simplication argument is persuasive.

@jorisvandenbossche
Copy link
Member

It is by definition an integer dtype with 64 bitwidth, but whether we choose Int64 or int64 is just an API choice (Int64 is also an int64 dtype).
There are quite some other operations where we preserve nullable dtypes, even if there are no missing values.

As mentioned above, while the return value of value_counts itself doesn't have missing values, follow-up operations in your pipeline could introduce them in which case the actual dtype makes a difference. A small example:

>>> s = pd.Series([1, 2, 1, 2, 4], dtype="Int64")
>>> s
0    1
1    2
2    1
3    2
4    4
dtype: Int64
>>> s.value_counts().reindex(list(range(5)))
0    <NA>
1       2
2       2
3    <NA>
4       1
dtype: Int64
>>> s.value_counts().reindex(list(range(5))).fillna(0)
0    0
1    2
2    2
3    0
4    1
dtype: Int64

If we return numpy.int64, the last two examples would be float data.

IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.

sure but this point is not relevant

what is relevant is that we should just pick a return type

it's pretty crazy that the output type is different here

so we need to either pick int64 or Int64 for the return value always

@mroeschke
Copy link
Member

I could see there being a consistency argument to return Int64 if there's a strong push to make the nullable numeric types the return type for all pandas operations eventually.

If not (or not in the near future), I think the simplicity of return np.int64 makes sense and the option of casting with astype("Int64") post value_counts could be left to the user.

@rhshadrach
Copy link
Member

Once a user has opted into nullable dtypes, it feels expected to me for pandas to continue to use nullable dtypes even if it doesn't have to (e.g. fillna). I think this is the way most ops work although admittedly, many of them can have a result with null values and so maybe "shouldn't count". By always returning int64, it seems to me we'd be creating special-cased behavior because of the peculiarities of the op itself, something which users may find surprising.

I do agree that we can have a user cast back to nullable dtypes if necessary, but I don't think of this as a preferred solution. It makes trying to use nullable dtypes more of a hassel.

@mroeschke
Copy link
Member

This looks to be the PR & discussion where Int64 return type was made: #30824

It appears the motivation was to promote the new nullable dtype back in 1.0

@jbrockmendel
Copy link
Member Author

closing as never-gonna-happen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API - Consistency Internal Consistency of API/Behavior NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants