-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Improve groupby().ngroup() explanation for missing groups #50049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/groupby.py
Outdated
@@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True): | |||
|
|||
Examples | |||
-------- | |||
>>> df = pd.DataFrame({"A": list("aaabba")}) | |||
>>> df = pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you combine the DataFrame construction in 1 line (construct the DataFrame from dictionary?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NickCrews, nice improvement. Added couple of comments. Sorry for the late review, I hope you still have time to address comments, would be great to see this merged.
pandas/core/groupby/groupby.py
Outdated
If a group would be excluded (due to null keys) then that | ||
group is labeled as np.nan. See examples below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be simpler to understand if we just simply say something like Groups where the key is missing are skipped from the count, and their value will be 'NaN'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer NA as opposed to "missing". There are various reasons a value can be NA, missing is but one of them. Also recommend using NA over NaN or null as it matches dropna
and pd.NA
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me. But in this case if we say NA
it may sound like pandas.NA
, and if the value is None
orNaN
it will also be skipped. So, missing seemed more appropriate to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not available
instead of NA
is more generic than missing, and doesn't sound like we are talking about pandas.NA
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - that was poorly written. My main reason is to connect it with the dropna
argument. I threw in pd.NA
as an indicator that we use NA with somewhat good consistency. E.g. dropna, pd.NA, isna, use_inf_as_na, fillna, na_filter. I think we should maintain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted to "Groups with missing keys (where pd.isna()
is True) will be labeled with NaN
and will be skipped from the count." Definitely this is the right direction, I'm still just not sure about the "missing" language
I agree with
But in this case if we say NA it may sound like pandas.NA, and if the value is None orNaN it will also be skipped.
so I avoided saying simply "groups with NA keys...". I think my phrasing is maybe a little longer, but extremely precise and fairly easy to understand. Another option I would also be satisfied with is "Groups with NA keys (where pd.isna()
is True)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this looks great to me.
pandas/core/groupby/groupby.py
Outdated
@@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True): | |||
|
|||
Examples | |||
-------- | |||
>>> df = pd.DataFrame({"A": list("aaabba")}) | |||
>>> df = pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Matt, also I think the examples can be simplified. I think having just a single column with the None
value is enough, no need to show without the None
first, and with it later.
Also, unrelated to your changes, but I'd remove df.groupby(["A", [1,1,2,3,2,1]]).ngroup()
. This is a very specific case, not directly related to ngroup
, and grouping with provided values is not even shown in the examples of groupby
, so it makes less sense to show it here. I think it mostly adds confusion to this docstring.
And also, I think it's better for users if the examples are a bit more meaningful than random letters. I think using something like df = pd.DataFrame({'color': ['red', 'red', 'green', None, 'red', 'green']})
(or with animals, jobs, whatever... would make the example a bit easier to understand. Just as an idea, if you want to keep this PR only for the example you're trying to illustrate, that's surely fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with all of these suggestions. fixup PR incoming.
Per comments at pandas-dev#50049
4 1.0 | ||
5 0.0 | ||
dtype: float64 | ||
>>> df.groupby("color", dropna=False).ngroup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped the ascending
example, I figured that was obvious enough. Think I should keep it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we typically demonstrate all the features of a method (ignoring methods with a large amount of arguments), regardless of how obvious they might seem. I think it should be retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restored
>>> df.groupby("color", dropna=False).ngroup(ascending=False)
0 1
1 0
2 1
3 2
4 2
5 1
dtype: int64
I chose to use dropna=False
because I wanted to show
that NA keys are placed BEFORE other keys.
I figured the dropna=True
example was obvious enough from this and
I didn't need that one as well.
Now I guess since the groups are lexicographically sorted, and we are using "red" and "blue" instead of "a" and "b", the ngroup labels have swapped order. I think therefore that this should be deterministic and not flaky.
@datapythonista @mroeschke @rhshadrach this is ready for re-review! Sorry for the tag if this is already in your queue :) That failing test looks unrelated to this change. |
4 1.0 | ||
5 0.0 | ||
dtype: float64 | ||
>>> df.groupby("color", dropna=False).ngroup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we typically demonstrate all the features of a method (ignoring methods with a large amount of arguments), regardless of how obvious they might seem. I think it should be retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, not strong opinion about the ascending example, from my side is fine both with and without it. Thanks @NickCrews
I chose to use `dropna=False` because I wanted to show that NA keys are placed BEFORE other keys. I figured the `dropna=True` example was obvious enough from this and I didn't need that one as well, otherwise I thought things got very verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM merge when ready @rhshadrach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @NickCrews |
The exact prose and examples are open to improvements if you have ideas.
I figure this didn't require a separate issue, please let me know if I'm missing some step in the process.