-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix Bug with NA value in Grouping for Groupby.nth #26152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26152 +/- ##
==========================================
- Coverage 91.97% 91.96% -0.01%
==========================================
Files 175 175
Lines 52368 52371 +3
==========================================
- Hits 48164 48163 -1
- Misses 4204 4208 +4
Continue to review full report at Codecov.
|
pandas/core/groupby/groupby.py
Outdated
def nth(self, n, dropna=None): | ||
def nth(self, | ||
n: Union[int, List[int]], | ||
dropna: Optional[Union[bool, str]] = None) -> DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options[str] here
pandas/core/groupby/groupby.py
Outdated
@@ -1546,15 +1547,16 @@ def backfill(self, limit=None): | |||
|
|||
@Substitution(name='groupby') | |||
@Substitution(see_also=_common_see_also) | |||
def nth(self, n, dropna=None): | |||
def nth(self, | |||
n: Union[int, Collection[int]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could arguably go list-like here if we wanted to try and add to _typing but didn't want to stray too far.
The documentation mentions only supporting lists, though the actual implementation makes accommodations for lists, sets or tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so generally we would use Sequence then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking that but the isinstance check in the method body also includes set, which wouldn't qualify as a Sequence.
I reverted it to List to keep it simple since Collection isn't available from typing until 3.6 and that is what the documentation shows anyway
@WillAyd has a failure in the checks (not the numpy_dev issue) |
pandas/core/groupby/groupby.py
Outdated
@@ -1636,11 +1637,13 @@ def nth(self, n, dropna=None): | |||
-nth_values) | |||
mask = mask_left | mask_right | |||
|
|||
ids, _, _ = self.grouper.group_info | |||
mask = mask & (ids != -1) # Drop NA values in grouping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put the comment on the prior line
pandas/core/groupby/groupby.py
Outdated
@@ -1665,6 +1668,7 @@ def nth(self, n, dropna=None): | |||
|
|||
# old behaviour, but with all and any support for DataFrames. | |||
# modified in GH 7559 to have better perf | |||
n = cast(int, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really really avoid the need to do this, you can instead assign to a new variable that is the correct type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. So in your mind would cast only be suitable for expressions without assignment or should we avoid altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really avoid having to cast at all; this is just plain confusing as its not 'code' but for annotation purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the branches to appease the type checker without cast but also arguably make the code more clear. Makes the diff a little larger than originally but I think for the better
mask_right = np.in1d(self._cumcount_array(ascending=False) + 1, | ||
-nth_values) | ||
-nth_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I had to change this because nth_values
was inferred as List[int]
upon initial assignment. This prevents reassignment from potentially obfuscating the type checker and code intent
nth_values = [n] | ||
elif isinstance(n, (set, list, tuple)): | ||
nth_values = list(set(n)) | ||
if dropna is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of explicitly checking if dropna is not None
this condition was refactored to sit in a branch that follows the implicit condition of if dropna
.
There is however a slight behavior difference between this PR and master, where now these are equivalent:
>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar'])
>>> df.groupby('foo').nth([0], dropna=None)
bar
foo
0 1
>>> df.groupby('foo').nth([0], dropna=False)
bar
foo
0 1
Whereas on master these would not yield the same thing:
>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar'])
>>> df.groupby('foo').nth([0], dropna=None)
bar
foo
0 1
>>> df.groupby('foo').nth([0], dropna=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1626, in nth
"dropna option with a list of nth values is not supported")
ValueError: dropna option with a list of nth values is not supported
I think the new behavior is more consistent and preferred. Might be worth a whatsnew
looks reasonable but i need to look again |
pandas/core/groupby/groupby.py
Outdated
# a column that is not in the current object) | ||
axis = self.grouper.axis | ||
grouper = axis[axis.isin(dropped.index)] | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you de-dent this here and remove the else which is unecessary; add a comment though of what this branch is
thanks @WillAyd |
git diff upstream/master -u -- "*.py" | flake8 --diff
Would ideally like to combine first, nth and last implementations. Consider this a precursor