Skip to content

ENH: DataFrameGroupby.value_counts #43564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
corriebar opened this issue Sep 14, 2021 · 4 comments · Fixed by #44267
Closed
2 of 3 tasks

ENH: DataFrameGroupby.value_counts #43564

corriebar opened this issue Sep 14, 2021 · 4 comments · Fixed by #44267
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement Groupby
Milestone

Comments

@corriebar
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({'gender': ['female', 'male', 'female', 'male', 'female', 'male'], 
                  'education': ['low', 'medium', 'high', 'low', 'high', 'low'],
                  'country': ['US', 'FR', 'US', 'FR', 'FR', 'US']})
# These all work fine:
# value_counts() on a DataFrame
df[['gender', 'education']].value_counts(normalize=True)
df['gender'].value_counts(normalize=True) # on a Series
df.groupby('country')['gender'].value_counts(normalize=True) # on a SeriesGroupby

# but fails on DataFrameGroupBy
df.groupby('country')[['gender', 'education']].value_counts(normalize=True)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/xz/_nt7tl1s0jx2f5_8rwc8kc0r0000gp/T/ipykernel_96325/3153685003.py in <module>
----> 1 df.groupby('country')[['gender', 'education']].value_counts(normalize=True)

~/opt/anaconda3/envs/pandas_dev/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in __getattr__(self, attr)
    909             return self[attr]
    910 
--> 911         raise AttributeError(
    912             f"'{type(self).__name__}' object has no attribute '{attr}'"
    913         )

AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts'

Issue Description

Since value_counts() is defined both for a Series and a DataFrame, I also expect it to work on both a SeriesGroupBy and a DataFrameGroupBy.

There are some related Issues: #39938 and #6540, these have been dismissed so far with the argument that size() already does that, but size() alone is not enough to get the proportions per group.

Expected Behavior

I managed to get the desired behaviour by applying value_counts() to each group:

def compute_proportions(df, var):
    return (df[var]
        .value_counts(normalize=True)
    )
(df.groupby('country')
  .apply(compute_proportions, var=['gender', 'education'])
)
country  gender  education
FR       female  high         0.333333
         male    low          0.333333
                 medium       0.333333
US       male    low          0.666667
         female  high         0.333333
dtype: float64

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.27.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@corriebar corriebar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021
@MarcoGorelli MarcoGorelli changed the title BUG: ENH: DataFrameGroupby.value_counts Sep 14, 2021
@MarcoGorelli MarcoGorelli added Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021
@MarcoGorelli
Copy link
Member

Hi @corriebar

I think this is a good request, thanks for including clear input + output + workaround

@rhshadrach any thoughts on this?

@rhshadrach
Copy link
Member

Thanks for reporting this, and agree with @MarcoGorelli - well written.

There are some related Issues: #39938 and #6540, these have been dismissed so far with the argument that size() already does that, but size() alone is not enough to get the proportions per group.

#6540 was asking for counts of the grouping columns, rather than the rest of the columns in the DataFrame. This is very much different from the request here. For #39938, while size was mentioned in the comments, this issue was closed because it lacked a reproducible example demonstrating the request.

In general I agree with having a consistent API between Series / DataFrame/ SeriesGroupBy / DataFrameGroupby when the operation makes sense, and that seems to me to be the case here. The oddity is that value_counts doesn't fit into one of the agg / transform / filter buckets that most groupby ops do.

The index should be the groups for agg, the original index for transform, and a subset of the original index for a filter - but because value_counts doesn't fit into one of these, I think it's okay that the resulting index be the grouping columns combined with the other columns in the DataFrame. This then agrees with Series.value_counts, DataFrame.value_counts and SeriesGroupBy.value_counts, so the proposed result looks good to me.

Looking at the code for SeriesGroupBy.value_counts, it seems like an implementation for DataFrameGroupBy would be non-trivial. Here is a naive attempt to use size that seems to perform well when compared to the SeriesGroupBy variant, but I'm guessing it will fail on various edge cases.

def gb_value_counts(df, keys, subset=None, normalize=False, sort=True, ascending=False, dropna=True):
    remaining_columns = list(df.columns.difference(keys))
    if subset is not None:
        remaining_columns = list(set(subset) - set(remaining_columns))
    if dropna:
        df = df.dropna(subset=remaining_columns, axis='index', how='any')
    result = df.groupby(keys + remaining_columns).size()
    if normalize:
        result /= df.groupby(keys).size()
    if sort:
        result = result.sort_values(ascending=ascending)
    return result

Timing:

size = 100000
df = pd.DataFrame(
    {
        'a': np.random.randint(0, 10, size),
        'b': np.random.randint(0, 10, size),
    }
).set_index(['a'])

%timeit gb_value_counts(df, ['a'], normalize=False, dropna=True)  # 5.83 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['b'].groupby('a').value_counts()  # 9.44 ms ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@rhshadrach rhshadrach added this to the Contributions Welcome milestone Sep 15, 2021
@rhshadrach rhshadrach added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby labels Sep 15, 2021
@johnzangwill
Copy link
Contributor

take

@johnzangwill
Copy link
Contributor

@rhshadrach I have implemented #44267 loosely based on your suggestion.
Please could you take a look at the implementation and tests to see if there are any cases that I have missed.
If possible I would like this in 1.4, so I need to start getting it reviewed...
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement Groupby
Projects
None yet
5 participants