Skip to content

DOC: Improve groupby().ngroup() explanation for missing groups #50049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 3, 2023

Conversation

NickCrews
Copy link
Contributor

The exact prose and examples are open to improvements if you have ideas.

I figure this didn't require a separate issue, please let me know if I'm missing some step in the process.

@@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True):

Examples
--------
>>> df = pd.DataFrame({"A": list("aaabba")})
>>> df = pd.DataFrame()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you combine the DataFrame construction in 1 line (construct the DataFrame from dictionary?)

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NickCrews, nice improvement. Added couple of comments. Sorry for the late review, I hope you still have time to address comments, would be great to see this merged.

Comment on lines 3215 to 3216
If a group would be excluded (due to null keys) then that
group is labeled as np.nan. See examples below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be simpler to understand if we just simply say something like Groups where the key is missing are skipped from the count, and their value will be 'NaN'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer NA as opposed to "missing". There are various reasons a value can be NA, missing is but one of them. Also recommend using NA over NaN or null as it matches dropna and pd.NA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. But in this case if we say NA it may sound like pandas.NA, and if the value is None orNaN it will also be skipped. So, missing seemed more appropriate to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not available instead of NA is more generic than missing, and doesn't sound like we are talking about pandas.NA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - that was poorly written. My main reason is to connect it with the dropna argument. I threw in pd.NA as an indicator that we use NA with somewhat good consistency. E.g. dropna, pd.NA, isna, use_inf_as_na, fillna, na_filter. I think we should maintain this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted to "Groups with missing keys (where pd.isna() is True) will be labeled with NaN and will be skipped from the count." Definitely this is the right direction, I'm still just not sure about the "missing" language

I agree with

But in this case if we say NA it may sound like pandas.NA, and if the value is None orNaN it will also be skipped.

so I avoided saying simply "groups with NA keys...". I think my phrasing is maybe a little longer, but extremely precise and fairly easy to understand. Another option I would also be satisfied with is "Groups with NA keys (where pd.isna() is True)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this looks great to me.

@@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True):

Examples
--------
>>> df = pd.DataFrame({"A": list("aaabba")})
>>> df = pd.DataFrame()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Matt, also I think the examples can be simplified. I think having just a single column with the None value is enough, no need to show without the None first, and with it later.

Also, unrelated to your changes, but I'd remove df.groupby(["A", [1,1,2,3,2,1]]).ngroup(). This is a very specific case, not directly related to ngroup, and grouping with provided values is not even shown in the examples of groupby, so it makes less sense to show it here. I think it mostly adds confusion to this docstring.

And also, I think it's better for users if the examples are a bit more meaningful than random letters. I think using something like df = pd.DataFrame({'color': ['red', 'red', 'green', None, 'red', 'green']}) (or with animals, jobs, whatever... would make the example a bit easier to understand. Just as an idea, if you want to keep this PR only for the example you're trying to illustrate, that's surely fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all of these suggestions. fixup PR incoming.

4 1.0
5 0.0
dtype: float64
>>> df.groupby("color", dropna=False).ngroup()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the ascending example, I figured that was obvious enough. Think I should keep it in?

Copy link
Member

@rhshadrach rhshadrach Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we typically demonstrate all the features of a method (ignoring methods with a large amount of arguments), regardless of how obvious they might seem. I think it should be retained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored

        >>> df.groupby("color", dropna=False).ngroup(ascending=False)
        0    1
        1    0
        2    1
        3    2
        4    2
        5    1
        dtype: int64

I chose to use dropna=False because I wanted to show
that NA keys are placed BEFORE other keys.

I figured the dropna=True example was obvious enough from this and
I didn't need that one as well.

Now I guess since the groups
are lexicographically sorted,
and we are using "red" and "blue" instead of "a" and "b",
the ngroup labels have swapped order.

I think therefore that this should be deterministic
and not flaky.
@NickCrews
Copy link
Contributor Author

NickCrews commented Jan 2, 2023

@datapythonista @mroeschke @rhshadrach this is ready for re-review! Sorry for the tag if this is already in your queue :)

That failing test looks unrelated to this change.

4 1.0
5 0.0
dtype: float64
>>> df.groupby("color", dropna=False).ngroup()
Copy link
Member

@rhshadrach rhshadrach Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we typically demonstrate all the features of a method (ignoring methods with a large amount of arguments), regardless of how obvious they might seem. I think it should be retained.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, not strong opinion about the ascending example, from my side is fine both with and without it. Thanks @NickCrews

I chose to use `dropna=False` because I wanted to show
that NA keys are placed
BEFORE other keys. I figured
the `dropna=True` example was obvious enough from this and
I didn't need that one as well, otherwise I thought things
got very verbose.
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM merge when ready @rhshadrach

@mroeschke mroeschke added this to the 2.0 milestone Jan 3, 2023
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit 58a0b6f into pandas-dev:main Jan 3, 2023
@rhshadrach
Copy link
Member

Thanks @NickCrews

@NickCrews NickCrews deleted the docs-nan-ngroup branch January 4, 2023 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants