ENH: DataFrameGroupby.value_counts #43564

corriebar · 2021-09-14T10:09:23Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({'gender': ['female', 'male', 'female', 'male', 'female', 'male'], 
                  'education': ['low', 'medium', 'high', 'low', 'high', 'low'],
                  'country': ['US', 'FR', 'US', 'FR', 'FR', 'US']})
# These all work fine:
# value_counts() on a DataFrame
df[['gender', 'education']].value_counts(normalize=True)
df['gender'].value_counts(normalize=True) # on a Series
df.groupby('country')['gender'].value_counts(normalize=True) # on a SeriesGroupby

# but fails on DataFrameGroupBy
df.groupby('country')[['gender', 'education']].value_counts(normalize=True)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/xz/_nt7tl1s0jx2f5_8rwc8kc0r0000gp/T/ipykernel_96325/3153685003.py in <module>
----> 1 df.groupby('country')[['gender', 'education']].value_counts(normalize=True)

~/opt/anaconda3/envs/pandas_dev/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in __getattr__(self, attr)
    909             return self[attr]
    910 
--> 911         raise AttributeError(
    912             f"'{type(self).__name__}' object has no attribute '{attr}'"
    913         )

AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts'

Issue Description

Since value_counts() is defined both for a Series and a DataFrame, I also expect it to work on both a SeriesGroupBy and a DataFrameGroupBy.

There are some related Issues: #39938 and #6540, these have been dismissed so far with the argument that size() already does that, but size() alone is not enough to get the proportions per group.

Expected Behavior

I managed to get the desired behaviour by applying value_counts() to each group:

def compute_proportions(df, var):
    return (df[var]
        .value_counts(normalize=True)
    )
(df.groupby('country')
  .apply(compute_proportions, var=['gender', 'education'])
)
country  gender  education
FR       female  high         0.333333
         male    low          0.333333
                 medium       0.333333
US       male    low          0.666667
         female  high         0.333333
dtype: float64

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.27.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2021-09-14T12:10:27Z

Hi @corriebar

I think this is a good request, thanks for including clear input + output + workaround

@rhshadrach any thoughts on this?

rhshadrach · 2021-09-15T02:58:32Z

Thanks for reporting this, and agree with @MarcoGorelli - well written.

There are some related Issues: #39938 and #6540, these have been dismissed so far with the argument that size() already does that, but size() alone is not enough to get the proportions per group.

#6540 was asking for counts of the grouping columns, rather than the rest of the columns in the DataFrame. This is very much different from the request here. For #39938, while size was mentioned in the comments, this issue was closed because it lacked a reproducible example demonstrating the request.

In general I agree with having a consistent API between Series / DataFrame/ SeriesGroupBy / DataFrameGroupby when the operation makes sense, and that seems to me to be the case here. The oddity is that value_counts doesn't fit into one of the agg / transform / filter buckets that most groupby ops do.

The index should be the groups for agg, the original index for transform, and a subset of the original index for a filter - but because value_counts doesn't fit into one of these, I think it's okay that the resulting index be the grouping columns combined with the other columns in the DataFrame. This then agrees with Series.value_counts, DataFrame.value_counts and SeriesGroupBy.value_counts, so the proposed result looks good to me.

Looking at the code for SeriesGroupBy.value_counts, it seems like an implementation for DataFrameGroupBy would be non-trivial. Here is a naive attempt to use size that seems to perform well when compared to the SeriesGroupBy variant, but I'm guessing it will fail on various edge cases.

def gb_value_counts(df, keys, subset=None, normalize=False, sort=True, ascending=False, dropna=True):
    remaining_columns = list(df.columns.difference(keys))
    if subset is not None:
        remaining_columns = list(set(subset) - set(remaining_columns))
    if dropna:
        df = df.dropna(subset=remaining_columns, axis='index', how='any')
    result = df.groupby(keys + remaining_columns).size()
    if normalize:
        result /= df.groupby(keys).size()
    if sort:
        result = result.sort_values(ascending=ascending)
    return result

Timing:

size = 100000
df = pd.DataFrame(
    {
        'a': np.random.randint(0, 10, size),
        'b': np.random.randint(0, 10, size),
    }
).set_index(['a'])

%timeit gb_value_counts(df, ['a'], normalize=False, dropna=True)  # 5.83 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['b'].groupby('a').value_counts()  # 9.44 ms ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

johnzangwill · 2021-10-12T10:52:08Z

take

johnzangwill · 2021-11-11T09:00:47Z

@rhshadrach I have implemented #44267 loosely based on your suggestion.
Please could you take a look at the implementation and tests to see if there are any cases that I have missed.
If possible I would like this in 1.4, so I need to start getting it reviewed...
Thanks!

corriebar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021

MarcoGorelli changed the title ~~BUG:~~ ENH: DataFrameGroupby.value_counts Sep 14, 2021

MarcoGorelli added Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021

rhshadrach added this to the Contributions Welcome milestone Sep 15, 2021

rhshadrach added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby labels Sep 15, 2021

github-actions bot assigned johnzangwill Oct 12, 2021

This was referenced Oct 31, 2021

DataFrameGroupby value_counts #44259

Closed

ENH: Add DataFrameGroupBy.value_counts #44267

Merged

jreback modified the milestones: Contributions Welcome, 1.4 Dec 17, 2021

jreback closed this as completed in #44267 Dec 19, 2021

simonjayhawkins mentioned this issue Jun 10, 2022

BUG: subset parameter of DataFrameGroupBy.value_counts has no effect #46383

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: DataFrameGroupby.value_counts #43564

ENH: DataFrameGroupby.value_counts #43564

corriebar commented Sep 14, 2021

INSTALLED VERSIONS

MarcoGorelli commented Sep 14, 2021

rhshadrach commented Sep 15, 2021

johnzangwill commented Oct 12, 2021

johnzangwill commented Nov 11, 2021

ENH: DataFrameGroupby.value_counts #43564

ENH: DataFrameGroupby.value_counts #43564

Comments

corriebar commented Sep 14, 2021

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

MarcoGorelli commented Sep 14, 2021

rhshadrach commented Sep 15, 2021

johnzangwill commented Oct 12, 2021

johnzangwill commented Nov 11, 2021