DOC: Improve groupby().ngroup() explanation for missing groups #50049

NickCrews · 2022-12-04T04:43:20Z

The exact prose and examples are open to improvements if you have ideas.

I figure this didn't require a separate issue, please let me know if I'm missing some step in the process.

mroeschke · 2022-12-05T23:56:12Z

pandas/core/groupby/groupby.py

@@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True):

        Examples
        --------
-        >>> df = pd.DataFrame({"A": list("aaabba")})
+        >>> df = pd.DataFrame()


Nit: Could you combine the DataFrame construction in 1 line (construct the DataFrame from dictionary?)

datapythonista

Thanks @NickCrews, nice improvement. Added couple of comments. Sorry for the late review, I hope you still have time to address comments, would be great to see this merged.

datapythonista · 2022-12-29T09:32:39Z

pandas/core/groupby/groupby.py

+        If a group would be excluded (due to null keys) then that
+        group is labeled as np.nan. See examples below.


I think it could be simpler to understand if we just simply say something like Groups where the key is missing are skipped from the count, and their value will be 'NaN'.

I'd prefer NA as opposed to "missing". There are various reasons a value can be NA, missing is but one of them. Also recommend using NA over NaN or null as it matches dropna and pd.NA.

Fine by me. But in this case if we say NA it may sound like pandas.NA, and if the value is None orNaN it will also be skipped. So, missing seemed more appropriate to me.

Maybe not available instead of NA is more generic than missing, and doesn't sound like we are talking about pandas.NA?

Sorry - that was poorly written. My main reason is to connect it with the dropna argument. I threw in pd.NA as an indicator that we use NA with somewhat good consistency. E.g. dropna, pd.NA, isna, use_inf_as_na, fillna, na_filter. I think we should maintain this.

I adjusted to "Groups with missing keys (where pd.isna() is True) will be labeled with NaN and will be skipped from the count." Definitely this is the right direction, I'm still just not sure about the "missing" language

I agree with

But in this case if we say NA it may sound like pandas.NA, and if the value is None orNaN it will also be skipped.

so I avoided saying simply "groups with NA keys...". I think my phrasing is maybe a little longer, but extremely precise and fairly easy to understand. Another option I would also be satisfied with is "Groups with NA keys (where pd.isna() is True)"

Thanks - this looks great to me.

datapythonista · 2022-12-29T09:38:56Z

pandas/core/groupby/groupby.py

@@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True):

        Examples
        --------
-        >>> df = pd.DataFrame({"A": list("aaabba")})
+        >>> df = pd.DataFrame()


Agree with Matt, also I think the examples can be simplified. I think having just a single column with the None value is enough, no need to show without the None first, and with it later.

Also, unrelated to your changes, but I'd remove df.groupby(["A", [1,1,2,3,2,1]]).ngroup(). This is a very specific case, not directly related to ngroup, and grouping with provided values is not even shown in the examples of groupby, so it makes less sense to show it here. I think it mostly adds confusion to this docstring.

And also, I think it's better for users if the examples are a bit more meaningful than random letters. I think using something like df = pd.DataFrame({'color': ['red', 'red', 'green', None, 'red', 'green']}) (or with animals, jobs, whatever... would make the example a bit easier to understand. Just as an idea, if you want to keep this PR only for the example you're trying to illustrate, that's surely fine.

I agree with all of these suggestions. fixup PR incoming.

Per comments at pandas-dev#50049

NickCrews · 2022-12-31T00:41:14Z

pandas/core/groupby/groupby.py

+        4    1.0
+        5    0.0
+        dtype: float64
+        >>> df.groupby("color", dropna=False).ngroup()


I dropped the ascending example, I figured that was obvious enough. Think I should keep it in?

I think we typically demonstrate all the features of a method (ignoring methods with a large amount of arguments), regardless of how obvious they might seem. I think it should be retained.

I restored

>>> df.groupby("color", dropna=False).ngroup(ascending=False) 0 1 1 0 2 1 3 2 4 2 5 1 dtype: int64

I chose to use dropna=False because I wanted to show
that NA keys are placed BEFORE other keys.

I figured the dropna=True example was obvious enough from this and
I didn't need that one as well.

Now I guess since the groups are lexicographically sorted, and we are using "red" and "blue" instead of "a" and "b", the ngroup labels have swapped order. I think therefore that this should be deterministic and not flaky.

NickCrews · 2023-01-02T21:04:42Z

@datapythonista @mroeschke @rhshadrach this is ready for re-review! Sorry for the tag if this is already in your queue :)

That failing test looks unrelated to this change.

rhshadrach · 2023-01-03T04:30:02Z

pandas/core/groupby/groupby.py

+        4    1.0
+        5    0.0
+        dtype: float64
+        >>> df.groupby("color", dropna=False).ngroup()


I think we typically demonstrate all the features of a method (ignoring methods with a large amount of arguments), regardless of how obvious they might seem. I think it should be retained.

datapythonista

lgtm, not strong opinion about the ascending example, from my side is fine both with and without it. Thanks @NickCrews

I chose to use `dropna=False` because I wanted to show that NA keys are placed BEFORE other keys. I figured the `dropna=True` example was obvious enough from this and I didn't need that one as well, otherwise I thought things got very verbose.

mroeschke

LGTM merge when ready @rhshadrach

rhshadrach

lgtm

rhshadrach · 2023-01-03T21:14:46Z

Thanks @NickCrews

DOC: Improve groupby().ngroup() explanation for missing groups

11ba535

mroeschke reviewed Dec 5, 2022

View reviewed changes

mroeschke added Docs Groupby labels Dec 5, 2022

datapythonista reviewed Dec 29, 2022

View reviewed changes

DOC: fixup PR suggestions

9466d4d

Per comments at pandas-dev#50049

NickCrews commented Dec 31, 2022

View reviewed changes

DOC: fixup: update order of labels

66ae02e

Now I guess since the groups are lexicographically sorted, and we are using "red" and "blue" instead of "a" and "b", the ngroup labels have swapped order. I think therefore that this should be deterministic and not flaky.

rhshadrach requested changes Jan 3, 2023

View reviewed changes

datapythonista approved these changes Jan 3, 2023

View reviewed changes

mroeschke approved these changes Jan 3, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Jan 3, 2023

rhshadrach approved these changes Jan 3, 2023

View reviewed changes

rhshadrach merged commit 58a0b6f into pandas-dev:main Jan 3, 2023

NickCrews deleted the docs-nan-ngroup branch January 4, 2023 02:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Improve groupby().ngroup() explanation for missing groups #50049

DOC: Improve groupby().ngroup() explanation for missing groups #50049

NickCrews commented Dec 4, 2022

mroeschke Dec 5, 2022

datapythonista left a comment

datapythonista Dec 29, 2022

rhshadrach Dec 30, 2022

datapythonista Dec 30, 2022

datapythonista Dec 30, 2022

rhshadrach Dec 30, 2022

NickCrews Dec 31, 2022

rhshadrach Jan 3, 2023

datapythonista Dec 29, 2022

NickCrews Dec 31, 2022

NickCrews Dec 31, 2022

rhshadrach Jan 3, 2023 •

edited

Loading

NickCrews Jan 3, 2023

NickCrews commented Jan 2, 2023 •

edited

Loading

rhshadrach Jan 3, 2023 •

edited

Loading

datapythonista left a comment

mroeschke left a comment

rhshadrach left a comment

rhshadrach commented Jan 3, 2023

		If a group would be excluded (due to null keys) then that
		group is labeled as np.nan. See examples below.

DOC: Improve groupby().ngroup() explanation for missing groups #50049

DOC: Improve groupby().ngroup() explanation for missing groups #50049

Conversation

NickCrews commented Dec 4, 2022

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews commented Jan 2, 2023 • edited Loading

rhshadrach Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented Jan 3, 2023

rhshadrach Jan 3, 2023 •

edited

Loading

NickCrews commented Jan 2, 2023 •

edited

Loading

rhshadrach Jan 3, 2023 •

edited

Loading