Skip to content

Fixes grouped rank bug with nullable types. #54460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Sep 22, 2023
Merged

Fixes grouped rank bug with nullable types. #54460

merged 29 commits into from
Sep 22, 2023

Conversation

Sai-Suraj-27
Copy link
Contributor

@Sai-Suraj-27 Sai-Suraj-27 commented Aug 8, 2023

@mroeschke mroeschke requested a review from rhshadrach August 8, 2023 20:31
@mroeschke mroeschke added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Aug 8, 2023
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Can you also put a note in the whatsnew for 2.1 under the groupby subsection of bugfixes.

@rhshadrach rhshadrach added Bug Transformations e.g. cumsum, diff, rank labels Aug 14, 2023
@Sai-Suraj-27
Copy link
Contributor Author

@rhshadrach sir, I have made all the suggested changes, can you once quickly review.

@Sai-Suraj-27
Copy link
Contributor Author

@rhshadrach sir, I have made all the suggested changes, and all the tests have passed. Please check now, Thank you.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Sorry - a few things I missed.

@@ -783,6 +783,7 @@ Plotting
Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
- Bug in :meth:`.DataFrameGroupBy.idxmin`, :meth:`.SeriesGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.SeriesGroupBy.idxmax` returns wrong dtype when used on an empty DataFrameGroupBy or SeriesGroupBy (:issue:`51423`)
- Bug in :meth:`DataFrame.groupby` when using the rank function on nullable datatypes when passing ``na_option="bottom"`` or ``na_option="top"`` (:issue:`54206`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do :meth:`DataFrameGroupBy.rank` here instead. Then no need to say "when using the rank function". This gives the reader a direct link to the method.

@@ -1492,6 +1492,9 @@ def _groupby_op(
else:
result_mask = np.zeros(ngroups, dtype=bool)

if op.how == "rank" and kwargs.get("na_option") in ["top", "bottom"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can just use how here instead of op.how. It avoids an attribute access, and relies less on the state of WrappedCythonOp.

Comment on lines 719 to 723
result_ext = df_ext.groupby("x", dropna=False)["x"].rank(
method="min", na_option=na_option
)
expected_result_ext = Series([1.0], dtype="float64", name=result_ext.name)
tm.assert_series_equal(result_ext, expected_result_ext, check_dtype=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the suffix _ext here? We typically just call these df, result, and expected. I think it's best to stick to that (unless there is a reason not to).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - why check_dtype=False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach sir, I made all the suggested changes. The reason for adding check_dtype=False was that initially the test was failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for adding check_dtype=False was that initially the test was failing.

If the issue with the dtype is result (and not caused by this PR), then we should report a bug. If the issue is with expected, then we should change the dtype of expected to the correct one and remove check_dtype=False.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sai-Suraj-27: Friendly ping here; I think this is the last bit and then we're all set to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach sir, Sorry i missed this notification somehow, just saw it again today! I have removed the check_dtype=False condition. let's see if all the tests pass, Please give your review, Thank you.

@Sai-Suraj-27
Copy link
Contributor Author

@rhshadrach sir, I made the required changes and all the tests are passing, Can you please check now, Thank you.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit b43d79d into pandas-dev:main Sep 22, 2023
@rhshadrach
Copy link
Member

Thanks @Sai-Suraj-27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays Transformations e.g. cumsum, diff, rank
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Grouped rank incorrect behaviour with nullable types
3 participants