-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/API (string dtype): return float dtype for series[str].rank() #59768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/API (string dtype): return float dtype for series[str].rank() #59768
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth an entry in the 2.3.0 whatsnew?
@@ -507,7 +490,9 @@ def test_rank_string_dtype(self, string_dtype_no_object): | |||
# GH#55362 | |||
obj = Series(["foo", "foo", None, "foo"], dtype=string_dtype_no_object) | |||
result = obj.rank(method="first") | |||
exp_dtype = "Int64" if string_dtype_no_object.na_value is pd.NA else "float64" | |||
exp_dtype = ( | |||
"Float64" if string_dtype_no_object == "string[pyarrow]" else "float64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this change the string[python]
case to float64
from Float64
? I guess yet another discussion for PDEP-13, but shouldn't the if string_dtype_no_object.na_value
invariant still remain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear this PR doesn't "change" anything regarding the flavor of dtypes (just changing int to float) and is only testing what we have right now, but you are certainly right this is inconsistent.
The technical reason is (AFAIK) that we just never implemented (yet) rank
for our nullable dtypes in general (so also the nullable int returns numpy floats instead of nullable float). This can be considered as a missing part of the implementation.
And at some point we added a custom rank
implementation on ArrowExtensionArray using pyarrow compute, and because of the subclass structure, the ArrowStringArray (i.e. "string[pyarrow]"
) inherits that but converts the pyarrow result to a nullable dtype (and "string[python]"
array class does not inherit from this, so uses the base class rank
implementation with always returns numpy types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK...very confusing. Thanks for clarifying
Yes, good idea, added one. |
Thanks @jorisvandenbossche |
…ndas-dev#59768) * BUG/API (string dtype): return float dtype for series[str].rank() * update frame tests * add whatsnew * correct whatsnew note
…9768) * BUG/API (string dtype): return float dtype for series[str].rank() * update frame tests * add whatsnew * correct whatsnew note
Noticed in #59758 (comment)
This is partially fixing a bug, because currently for cases where we actually need a float result but try to convert it back to an int, we get an error:
But in general we should also decide what to do return here. For our default dtypes, it seems we decided in the past to simply always return float64, even for rank
method
s that could return ints.For the ArrowDtype, then, it was decided to keep the dtype returned by pyarrow: #50264
For now, this PR updates
StringDtype
to simply always return float64, to be consistent between the pyarrow vs python storage. But we could also consider, for those newer dtypes, to actually keep the distinction between float/int results (just not always int like it is done now).