-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: value_counts with nullable dtype should return np.int64 like everything else #44679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The return value of The nullable dtypes are optional, but once opting in, we should IMO keep using them as much as possible for results (eg also fillna keeps the nullable dtype, although the result has no NAs) |
i agree with @jbrockmendel here. if we know that a type is int64 by-definition / always. I don't think we should just return it for this opearation. The simplication argument is persuasive. |
It is by definition an integer dtype with 64 bitwidth, but whether we choose As mentioned above, while the return value of >>> s = pd.Series([1, 2, 1, 2, 4], dtype="Int64")
>>> s
0 1
1 2
2 1
3 2
4 4
dtype: Int64
>>> s.value_counts().reindex(list(range(5)))
0 <NA>
1 2
2 2
3 <NA>
4 1
dtype: Int64
>>> s.value_counts().reindex(list(range(5))).fillna(0)
0 0
1 2
2 2
3 0
4 1
dtype: Int64 If we return numpy.int64, the last two examples would be float data. IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type. |
sure but this point is not relevant what is relevant is that we should just pick a return type it's pretty crazy that the output type is different here so we need to either pick int64 or Int64 for the return value always |
I could see there being a consistency argument to return Int64 if there's a strong push to make the nullable numeric types the return type for all pandas operations eventually. If not (or not in the near future), I think the simplicity of return |
Once a user has opted into nullable dtypes, it feels expected to me for pandas to continue to use nullable dtypes even if it doesn't have to (e.g. fillna). I think this is the way most ops work although admittedly, many of them can have a result with null values and so maybe "shouldn't count". By always returning int64, it seems to me we'd be creating special-cased behavior because of the peculiarities of the op itself, something which users may find surprising. I do agree that we can have a user cast back to nullable dtypes if necessary, but I don't think of this as a preferred solution. It makes trying to use nullable dtypes more of a hassel. |
This looks to be the PR & discussion where Int64 return type was made: #30824 It appears the motivation was to promote the new nullable dtype back in 1.0 |
closing as never-gonna-happen |
ATM these are special-cased to return Int64 (i.e. nullable) instead of np.int64. But the result of value_counts will never have any NAs, so there is no benefit. It complicates the code, complicates the API, and prevents us from sharing tests.
These should return np.int64 dtype like everything else.
The text was updated successfully, but these errors were encountered: