Fix Bug with NA value in Grouping for Groupby.nth #26152

WillAyd · 2019-04-19T16:54:38Z

closes Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Would ideally like to combine first, nth and last implementations. Consider this a precursor

codecov · 2019-04-19T18:49:24Z

Codecov Report

Merging #26152 into master will decrease coverage by <.01%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master   #26152      +/-   ##
==========================================
- Coverage   91.97%   91.96%   -0.01%     
==========================================
  Files         175      175              
  Lines       52368    52371       +3     
==========================================
- Hits        48164    48163       -1     
- Misses       4204     4208       +4

Flag	Coverage Δ
#multiple	`90.52% <93.75%> (ø)`	⬆️
#single	`40.69% <12.5%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`97.24% <93.75%> (+0.01%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9feb3ad...c2a0b8e. Read the comment docs.

jreback · 2019-04-19T19:51:13Z

pandas/core/groupby/groupby.py

-    def nth(self, n, dropna=None):
+    def nth(self,
+            n: Union[int, List[int]],
+            dropna: Optional[Union[bool, str]] = None) -> DataFrame:


Options[str] here

WillAyd · 2019-04-19T20:21:50Z

pandas/core/groupby/groupby.py

@@ -1546,15 +1547,16 @@ def backfill(self, limit=None):

    @Substitution(name='groupby')
    @Substitution(see_also=_common_see_also)
-    def nth(self, n, dropna=None):
+    def nth(self,
+            n: Union[int, Collection[int]],


Could arguably go list-like here if we wanted to try and add to _typing but didn't want to stray too far.

The documentation mentions only supporting lists, though the actual implementation makes accommodations for lists, sets or tuples

so generally we would use Sequence then

Was thinking that but the isinstance check in the method body also includes set, which wouldn't qualify as a Sequence.

I reverted it to List to keep it simple since Collection isn't available from typing until 3.6 and that is what the documentation shows anyway

jreback · 2019-04-20T14:40:24Z

@WillAyd has a failure in the checks (not the numpy_dev issue)

jreback · 2019-04-20T14:41:00Z

pandas/core/groupby/groupby.py

@@ -1636,11 +1637,13 @@ def nth(self, n, dropna=None):
                                 -nth_values)
            mask = mask_left | mask_right

+            ids, _, _ = self.grouper.group_info
+            mask = mask & (ids != -1)  # Drop NA values in grouping


can you put the comment on the prior line

jreback · 2019-04-20T14:42:08Z

pandas/core/groupby/groupby.py

@@ -1665,6 +1668,7 @@ def nth(self, n, dropna=None):

        # old behaviour, but with all and any support for DataFrames.
        # modified in GH 7559 to have better perf
+        n = cast(int, n)


I would really really avoid the need to do this, you can instead assign to a new variable that is the correct type

No problem. So in your mind would cast only be suitable for expressions without assignment or should we avoid altogether?

I would really avoid having to cast at all; this is just plain confusing as its not 'code' but for annotation purposes.

I've refactored the branches to appease the type checker without cast but also arguably make the code more clear. Makes the diff a little larger than originally but I think for the better

WillAyd · 2019-04-22T21:33:49Z

pandas/core/groupby/groupby.py

            mask_right = np.in1d(self._cumcount_array(ascending=False) + 1,
-                                 -nth_values)
+                                 -nth_array)


Note I had to change this because nth_values was inferred as List[int] upon initial assignment. This prevents reassignment from potentially obfuscating the type checker and code intent

WillAyd · 2019-04-22T21:38:26Z

pandas/core/groupby/groupby.py

-            nth_values = [n]
-        elif isinstance(n, (set, list, tuple)):
-            nth_values = list(set(n))
-            if dropna is not None:


Instead of explicitly checking if dropna is not None this condition was refactored to sit in a branch that follows the implicit condition of if dropna.

There is however a slight behavior difference between this PR and master, where now these are equivalent:

>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar']) >>> df.groupby('foo').nth([0], dropna=None) bar foo 0 1 >>> df.groupby('foo').nth([0], dropna=False) bar foo 0 1

Whereas on master these would not yield the same thing:

>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar']) >>> df.groupby('foo').nth([0], dropna=None) bar foo 0 1 >>> df.groupby('foo').nth([0], dropna=False) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1626, in nth "dropna option with a list of nth values is not supported") ValueError: dropna option with a list of nth values is not supported

I think the new behavior is more consistent and preferred. Might be worth a whatsnew

jreback · 2019-04-26T01:28:34Z

looks reasonable but i need to look again

jreback · 2019-04-28T17:14:30Z

pandas/core/groupby/groupby.py

-            # a column that is not in the current object)
-            axis = self.grouper.axis
-            grouper = axis[axis.isin(dropped.index)]
-
        else:


can you de-dent this here and remove the else which is unecessary; add a comment though of what this branch is

jreback · 2019-05-05T22:20:05Z

thanks @WillAyd

WillAyd added 5 commits April 17, 2019 14:06

Added failing test

e62a14f

Merge remote-tracking branch 'upstream/master' into groupby-nth-nan

e113b55

Added fix in code

823225b

Expanded test coverage scope

068936f

Whatsnew note

8716764

WillAyd added the Groupby label Apr 19, 2019

WillAyd added this to the 0.25.0 milestone Apr 19, 2019

Fixed implementation

3000d2b

jreback requested changes Apr 19, 2019

View reviewed changes

WillAyd added 2 commits April 19, 2019 13:02

Typing and doc fixups

84ddd6d

Fixed dropna annotation

84d343a

WillAyd commented Apr 19, 2019

View reviewed changes

WillAyd added 2 commits April 19, 2019 13:23

Replaced Collection reference with List in annotation

6f40fbe

Fixed cast expression

e2d006d

jreback requested changes Apr 20, 2019

View reviewed changes

WillAyd added 4 commits April 22, 2019 13:48

Merge remote-tracking branch 'upstream/master' into groupby-nth-nan

69014a7

Moved comment position

0b7eb6c

Removed re-assignment of nth_values for typing

48d90ee

typing fixup

7eea5e3

WillAyd commented Apr 22, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into groupby-nth-nan

4dec450

jreback requested changes Apr 28, 2019

View reviewed changes

WillAyd added 2 commits April 28, 2019 18:27

Merge remote-tracking branch 'upstream/master' into groupby-nth-nan

ce0abcd

dedented and added comment

c2a0b8e

jreback approved these changes May 5, 2019

View reviewed changes

jreback merged commit 7120725 into pandas-dev:master May 5, 2019

WillAyd deleted the groupby-nth-nan branch January 16, 2020 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Bug with NA value in Grouping for Groupby.nth #26152

Fix Bug with NA value in Grouping for Groupby.nth #26152

WillAyd commented Apr 19, 2019

codecov bot commented Apr 19, 2019 •

edited

Loading

jreback Apr 19, 2019

WillAyd Apr 19, 2019

jreback Apr 19, 2019

WillAyd Apr 19, 2019 •

edited

Loading

jreback commented Apr 20, 2019

jreback Apr 20, 2019

jreback Apr 20, 2019

WillAyd Apr 20, 2019

jreback Apr 20, 2019

WillAyd Apr 22, 2019

WillAyd Apr 22, 2019

WillAyd Apr 22, 2019

jreback commented Apr 26, 2019

jreback Apr 28, 2019

jreback commented May 5, 2019

Fix Bug with NA value in Grouping for Groupby.nth #26152

Fix Bug with NA value in Grouping for Groupby.nth #26152

Conversation

WillAyd commented Apr 19, 2019

codecov bot commented Apr 19, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Apr 19, 2019 • edited Loading

Choose a reason for hiding this comment

jreback commented Apr 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 26, 2019

Choose a reason for hiding this comment

jreback commented May 5, 2019

codecov bot commented Apr 19, 2019 •

edited

Loading

WillAyd Apr 19, 2019 •

edited

Loading