Skip to content

BUG/API (string dtype): return float dtype for series[str].rank() #59768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 12, 2024

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 9, 2024

Noticed in #59758 (comment)

This is partially fixing a bug, because currently for cases where we actually need a float result but try to convert it back to an int, we get an error:

In [58]: pd.Series([2, 1, 1]).rank()
Out[58]: 
0    3.0
1    1.5
2    1.5
dtype: float64

In [59]: pd.Series(["2", "1", "1"], dtype="string[pyarrow]").rank()
...
File ~/scipy/repos/pandas/pandas/core/arrays/string_arrow.py:445, in _convert_int_result(self, result)

File ~/scipy/repos/pandas/pandas/core/arrays/numeric.py:93, in NumericDtype.__from_arrow__(self, array)
---> 93     array = array.cast(pyarrow_type)
 ...
ArrowInvalid: Float value 1.5 was truncated converting to int64

But in general we should also decide what to do return here. For our default dtypes, it seems we decided in the past to simply always return float64, even for rank methods that could return ints.
For the ArrowDtype, then, it was decided to keep the dtype returned by pyarrow: #50264

For now, this PR updates StringDtype to simply always return float64, to be consistent between the pyarrow vs python storage. But we could also consider, for those newer dtypes, to actually keep the distinction between float/int results (just not always int like it is done now).

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data Arrow pyarrow functionality Transformations e.g. cumsum, diff, rank labels Sep 9, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Sep 9, 2024
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth an entry in the 2.3.0 whatsnew?

@@ -507,7 +490,9 @@ def test_rank_string_dtype(self, string_dtype_no_object):
# GH#55362
obj = Series(["foo", "foo", None, "foo"], dtype=string_dtype_no_object)
result = obj.rank(method="first")
exp_dtype = "Int64" if string_dtype_no_object.na_value is pd.NA else "float64"
exp_dtype = (
"Float64" if string_dtype_no_object == "string[pyarrow]" else "float64"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change the string[python] case to float64 from Float64? I guess yet another discussion for PDEP-13, but shouldn't the if string_dtype_no_object.na_value invariant still remain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear this PR doesn't "change" anything regarding the flavor of dtypes (just changing int to float) and is only testing what we have right now, but you are certainly right this is inconsistent.

The technical reason is (AFAIK) that we just never implemented (yet) rank for our nullable dtypes in general (so also the nullable int returns numpy floats instead of nullable float). This can be considered as a missing part of the implementation.
And at some point we added a custom rank implementation on ArrowExtensionArray using pyarrow compute, and because of the subclass structure, the ArrowStringArray (i.e. "string[pyarrow]") inherits that but converts the pyarrow result to a nullable dtype (and "string[python]" array class does not inherit from this, so uses the base class rank implementation with always returns numpy types).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK...very confusing. Thanks for clarifying

@jorisvandenbossche
Copy link
Member Author

Is it worth an entry in the 2.3.0 whatsnew?

Yes, good idea, added one.

@mroeschke mroeschke merged commit 2c49f55 into pandas-dev:main Sep 12, 2024
41 of 47 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-rank branch September 13, 2024 20:54
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
…ndas-dev#59768)

* BUG/API (string dtype): return float dtype for series[str].rank()

* update frame tests

* add whatsnew

* correct whatsnew note
jorisvandenbossche added a commit that referenced this pull request Oct 10, 2024
…9768)

* BUG/API (string dtype): return float dtype for series[str].rank()

* update frame tests

* add whatsnew

* correct whatsnew note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality backported Strings String extension data type and string data Transformations e.g. cumsum, diff, rank
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants