Skip to content

API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xflr6 opened this issue Oct 3, 2016 · 5 comments
Closed

API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14336

xflr6 opened this issue Oct 3, 2016 · 5 comments

Comments

@xflr6
Copy link
Contributor

xflr6 commented Oct 3, 2016

When exploring a data set, I often need to df.apply(pd.Series.nunique) or df.apply(lambda x: x.nunique()). How about adding this as nunique()-method parallel to DataFrame.count() (count and unique are also the two most basic infos displayed by DataFrame.describe())?

I think there are also use cases for this as a groupby-method, for example when checking a candidate primary key for different lines (values):

>>> import pandas as pd
>>> df = pd.DataFrame({'id': ['spam', 'eggs', 'eggs', 'spam'], 'value': [1, 5, 5, 2]})
>>> df.groupby('id').filter(lambda g: (g.apply(pd.Series.nunique) > 1).any())
     id  value
0  spam      1
3  spam      2
@shoyer
Copy link
Member

shoyer commented Oct 3, 2016

Agreed, I think this would be welcome functionality.

@jreback
Copy link
Contributor

jreback commented Oct 3, 2016

Note that these are already defined for Series.

In [9]: 
   ...: df.groupby('id').value.nunique()
Out[9]: 
id
eggs    1
spam    2
Name: value, dtype: int64

In [10]: 
    ...: df.groupby('id').value.unique()
Out[10]: 
id
eggs       [5]
spam    [1, 2]
Name: value, dtype: object

@jreback jreback added this to the Next Major Release milestone Oct 3, 2016
@xflr6
Copy link
Contributor Author

xflr6 commented Oct 4, 2016

Of course, extending the groupby-example:

>>> df = pd.DataFrame({'id': ['spam', 'eggs', 'eggs', 'spam', 'ham', 'ham'],
                       'value1': [1, 5, 5, 2, 5, 5], 'value2': list('abbaxy')})
>>> df
     id  value1 value2
0  spam       1      a
1  eggs       5      b
2  eggs       5      b
3  spam       2      a
4   ham       5      x
5   ham       5      y
>>> df.groupby('id').filter(lambda g: (g.apply(pd.Series.nunique) > 1).any())
     id  value1 value2
0  spam       1      a
3  spam       2      a
4   ham       5      x
5   ham       5      y

@jreback jreback modified the milestones: 0.20.0, Next Major Release Jan 2, 2017
@mahnunchik
Copy link

Any news?

@jreback
Copy link
Contributor

jreback commented Jan 23, 2017

just merged.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#14336

Author: Sebastian Bank <[email protected]>

Closes pandas-dev#14376 from xflr6/nunique and squashes the following commits:

a0558e7 [Sebastian Bank] use apply()-kwargs instead of partial, more tests, better examples
c8d3ac4 [Sebastian Bank] extend docs and tests
fd0f22d [Sebastian Bank] add simple benchmarks
5c4b325 [Sebastian Bank] API: add DataFrame.nunique() and DataFrameGroupBy.nunique()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants