BUG/API (string dtype): return float dtype for series[str].rank() #59768

jorisvandenbossche · 2024-09-09T20:33:18Z

This is partially fixing a bug, because currently for cases where we actually need a float result but try to convert it back to an int, we get an error:

In [58]: pd.Series([2, 1, 1]).rank()
Out[58]: 
0    3.0
1    1.5
2    1.5
dtype: float64

In [59]: pd.Series(["2", "1", "1"], dtype="string[pyarrow]").rank()
...
File ~/scipy/repos/pandas/pandas/core/arrays/string_arrow.py:445, in _convert_int_result(self, result)

File ~/scipy/repos/pandas/pandas/core/arrays/numeric.py:93, in NumericDtype.__from_arrow__(self, array)
---> 93     array = array.cast(pyarrow_type)
 ...
ArrowInvalid: Float value 1.5 was truncated converting to int64

But in general we should also decide what to do return here. For our default dtypes, it seems we decided in the past to simply always return float64, even for rank methods that could return ints.
For the ArrowDtype, then, it was decided to keep the dtype returned by pyarrow: #50264

For now, this PR updates StringDtype to simply always return float64, to be consistent between the pyarrow vs python storage. But we could also consider, for those newer dtypes, to actually keep the distinction between float/int results (just not always int like it is done now).

mroeschke

Is it worth an entry in the 2.3.0 whatsnew?

WillAyd · 2024-09-11T13:23:28Z

pandas/tests/frame/methods/test_rank.py

@@ -507,7 +490,9 @@ def test_rank_string_dtype(self, string_dtype_no_object):
        # GH#55362
        obj = Series(["foo", "foo", None, "foo"], dtype=string_dtype_no_object)
        result = obj.rank(method="first")
-        exp_dtype = "Int64" if string_dtype_no_object.na_value is pd.NA else "float64"
+        exp_dtype = (
+            "Float64" if string_dtype_no_object == "string[pyarrow]" else "float64"


Why does this change the string[python] case to float64 from Float64? I guess yet another discussion for PDEP-13, but shouldn't the if string_dtype_no_object.na_value invariant still remain?

To be clear this PR doesn't "change" anything regarding the flavor of dtypes (just changing int to float) and is only testing what we have right now, but you are certainly right this is inconsistent.

The technical reason is (AFAIK) that we just never implemented (yet) rank for our nullable dtypes in general (so also the nullable int returns numpy floats instead of nullable float). This can be considered as a missing part of the implementation.
And at some point we added a custom rank implementation on ArrowExtensionArray using pyarrow compute, and because of the subclass structure, the ArrowStringArray (i.e. "string[pyarrow]") inherits that but converts the pyarrow result to a nullable dtype (and "string[python]" array class does not inherit from this, so uses the base class rank implementation with always returns numpy types).

Ah OK...very confusing. Thanks for clarifying

jorisvandenbossche · 2024-09-11T19:47:18Z

Is it worth an entry in the 2.3.0 whatsnew?

Yes, good idea, added one.

mroeschke · 2024-09-12T21:08:40Z

Thanks @jorisvandenbossche

…ndas-dev#59768) * BUG/API (string dtype): return float dtype for series[str].rank() * update frame tests * add whatsnew * correct whatsnew note

…9768) * BUG/API (string dtype): return float dtype for series[str].rank() * update frame tests * add whatsnew * correct whatsnew note

BUG/API (string dtype): return float dtype for series[str].rank()

5432f2a

jorisvandenbossche added Strings String extension data type and string data Arrow pyarrow functionality Transformations e.g. cumsum, diff, rank labels Sep 9, 2024

jorisvandenbossche requested review from WillAyd and lukemanley September 9, 2024 20:33

jorisvandenbossche added this to the 2.3 milestone Sep 9, 2024

jorisvandenbossche mentioned this pull request Sep 9, 2024

TST (string dtype): remove usage of 'string[pyarrow_numpy]' alias #59758

Merged

jorisvandenbossche added 2 commits September 9, 2024 22:34

Merge remote-tracking branch 'upstream/main' into string-dtype-rank

5aae560

update frame tests

b9c4454

mroeschke reviewed Sep 10, 2024

View reviewed changes

WillAyd requested changes Sep 11, 2024

View reviewed changes

jorisvandenbossche added 2 commits September 11, 2024 21:42

Merge remote-tracking branch 'upstream/main' into string-dtype-rank

af216fb

add whatsnew

7cd34df

correct whatsnew note

b78acd2

WillAyd approved these changes Sep 12, 2024

View reviewed changes

mroeschke approved these changes Sep 12, 2024

View reviewed changes

mroeschke merged commit 2c49f55 into pandas-dev:main Sep 12, 2024
41 of 47 checks passed

jorisvandenbossche deleted the string-dtype-rank branch September 13, 2024 20:54

jorisvandenbossche added the backported label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG/API (string dtype): return float dtype for series[str].rank() #59768

BUG/API (string dtype): return float dtype for series[str].rank() #59768

Uh oh!

jorisvandenbossche commented Sep 9, 2024 •

edited

Loading

Uh oh!

mroeschke left a comment

Uh oh!

WillAyd Sep 11, 2024

Uh oh!

jorisvandenbossche Sep 11, 2024

Uh oh!

WillAyd Sep 12, 2024

Uh oh!

jorisvandenbossche commented Sep 11, 2024

Uh oh!

Uh oh!

mroeschke commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

BUG/API (string dtype): return float dtype for series[str].rank() #59768

BUG/API (string dtype): return float dtype for series[str].rank() #59768

Uh oh!

Conversation

jorisvandenbossche commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

WillAyd Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

WillAyd Sep 12, 2024

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Sep 11, 2024

Uh oh!

Uh oh!

mroeschke commented Sep 12, 2024

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 9, 2024 •

edited

Loading