Skip to content

Fix Bug with NA value in Grouping for Groupby.nth #26152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 5, 2019

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Apr 19, 2019

Would ideally like to combine first, nth and last implementations. Consider this a precursor

@WillAyd WillAyd added this to the 0.25.0 milestone Apr 19, 2019
@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #26152 into master will decrease coverage by <.01%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26152      +/-   ##
==========================================
- Coverage   91.97%   91.96%   -0.01%     
==========================================
  Files         175      175              
  Lines       52368    52371       +3     
==========================================
- Hits        48164    48163       -1     
- Misses       4204     4208       +4
Flag Coverage Δ
#multiple 90.52% <93.75%> (ø) ⬆️
#single 40.69% <12.5%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 97.24% <93.75%> (+0.01%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9feb3ad...c2a0b8e. Read the comment docs.

def nth(self, n, dropna=None):
def nth(self,
n: Union[int, List[int]],
dropna: Optional[Union[bool, str]] = None) -> DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options[str] here

@@ -1546,15 +1547,16 @@ def backfill(self, limit=None):

@Substitution(name='groupby')
@Substitution(see_also=_common_see_also)
def nth(self, n, dropna=None):
def nth(self,
n: Union[int, Collection[int]],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could arguably go list-like here if we wanted to try and add to _typing but didn't want to stray too far.

The documentation mentions only supporting lists, though the actual implementation makes accommodations for lists, sets or tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so generally we would use Sequence then

Copy link
Member Author

@WillAyd WillAyd Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking that but the isinstance check in the method body also includes set, which wouldn't qualify as a Sequence.

I reverted it to List to keep it simple since Collection isn't available from typing until 3.6 and that is what the documentation shows anyway

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

@WillAyd has a failure in the checks (not the numpy_dev issue)

@@ -1636,11 +1637,13 @@ def nth(self, n, dropna=None):
-nth_values)
mask = mask_left | mask_right

ids, _, _ = self.grouper.group_info
mask = mask & (ids != -1) # Drop NA values in grouping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the comment on the prior line

@@ -1665,6 +1668,7 @@ def nth(self, n, dropna=None):

# old behaviour, but with all and any support for DataFrames.
# modified in GH 7559 to have better perf
n = cast(int, n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really really avoid the need to do this, you can instead assign to a new variable that is the correct type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. So in your mind would cast only be suitable for expressions without assignment or should we avoid altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really avoid having to cast at all; this is just plain confusing as its not 'code' but for annotation purposes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the branches to appease the type checker without cast but also arguably make the code more clear. Makes the diff a little larger than originally but I think for the better

mask_right = np.in1d(self._cumcount_array(ascending=False) + 1,
-nth_values)
-nth_array)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I had to change this because nth_values was inferred as List[int] upon initial assignment. This prevents reassignment from potentially obfuscating the type checker and code intent

nth_values = [n]
elif isinstance(n, (set, list, tuple)):
nth_values = list(set(n))
if dropna is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of explicitly checking if dropna is not None this condition was refactored to sit in a branch that follows the implicit condition of if dropna.

There is however a slight behavior difference between this PR and master, where now these are equivalent:

>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar'])
>>> df.groupby('foo').nth([0], dropna=None)
     bar
foo
0      1
>>> df.groupby('foo').nth([0], dropna=False)
     bar
foo
0      1

Whereas on master these would not yield the same thing:

>>> df = pd.DataFrame([[0, 1], [0, 2]], columns=['foo', 'bar'])
>>> df.groupby('foo').nth([0], dropna=None)
     bar
foo
0      1
>>> df.groupby('foo').nth([0], dropna=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1626, in nth
    "dropna option with a list of nth values is not supported")
ValueError: dropna option with a list of nth values is not supported

I think the new behavior is more consistent and preferred. Might be worth a whatsnew

@jreback
Copy link
Contributor

jreback commented Apr 26, 2019

looks reasonable but i need to look again

# a column that is not in the current object)
axis = self.grouper.axis
grouper = axis[axis.isin(dropped.index)]

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you de-dent this here and remove the else which is unecessary; add a comment though of what this branch is

@jreback jreback merged commit 7120725 into pandas-dev:master May 5, 2019
@jreback
Copy link
Contributor

jreback commented May 5, 2019

thanks @WillAyd

@WillAyd WillAyd deleted the groupby-nth-nan branch January 16, 2020 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value
2 participants