DOC: `value_counts` description doesn't match code logic #48635

maciejskorski · 2022-09-19T08:51:04Z

Pandas version checks

I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

dev docs

last release docs

Documentation problem

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

In reality, it only does groupby followed by size, no nunique:

pandas/pandas/core/frame.py

Lines 6468 to 6592 in ca60aab

    
               def value_counts( 
        
                   self, 
        
                   subset: Sequence[Hashable] | None = None, 
        
                   normalize: bool = False, 
        
                   sort: bool = True, 
        
                   ascending: bool = False, 
        
                   dropna: bool = True, 
        
               ): 
        
                   """ 
        
                   Return a Series containing counts of unique rows in the DataFrame. 
        
                   .. versionadded:: 1.1.0 
        
                   Parameters 
        
                   ---------- 
        
                   subset : list-like, optional 
        
                       Columns to use when counting unique combinations. 
        
                   normalize : bool, default False 
        
                       Return proportions rather than frequencies. 
        
                   sort : bool, default True 
        
                       Sort by frequencies. 
        
                   ascending : bool, default False 
        
                       Sort in ascending order. 
        
                   dropna : bool, default True 
        
                       Don’t include counts of rows that contain NA values. 
        
                       .. versionadded:: 1.3.0 
        
                   Returns 
        
                   ------- 
        
                   Series 
        
                   See Also 
        
                   -------- 
        
                   Series.value_counts: Equivalent method on Series. 
        
                   Notes 
        
                   ----- 
        
                   The returned Series will have a MultiIndex with one level per input 
        
                   column. By default, rows that contain any NA values are omitted from 
        
                   the result. By default, the resulting Series will be in descending 
        
                   order so that the first element is the most frequently-occurring row. 
        
                   Examples 
        
                   -------- 
        
                   >>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6], 
        
                   ...                    'num_wings': [2, 0, 0, 0]}, 
        
                   ...                   index=['falcon', 'dog', 'cat', 'ant']) 
        
                   >>> df 
        
                           num_legs  num_wings 
        
                   falcon         2          2 
        
                   dog            4          0 
        
                   cat            4          0 
        
                   ant            6          0 
        
                   >>> df.value_counts() 
        
                   num_legs  num_wings 
        
                   4         0            2 
        
                   2         2            1 
        
                   6         0            1 
        
                   dtype: int64 
        
                   >>> df.value_counts(sort=False) 
        
                   num_legs  num_wings 
        
                   2         2            1 
        
                   4         0            2 
        
                   6         0            1 
        
                   dtype: int64 
        
                   >>> df.value_counts(ascending=True) 
        
                   num_legs  num_wings 
        
                   2         2            1 
        
                   6         0            1 
        
                   4         0            2 
        
                   dtype: int64 
        
                   >>> df.value_counts(normalize=True) 
        
                   num_legs  num_wings 
        
                   4         0            0.50 
        
                   2         2            0.25 
        
                   6         0            0.25 
        
                   dtype: float64 
        
                   With `dropna` set to `False` we can also count rows with NA values. 
        
                   >>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'], 
        
                   ...                    'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']}) 
        
                   >>> df 
        
                     first_name middle_name 
        
                   0       John       Smith 
        
                   1       Anne        <NA> 
        
                   2       John        <NA> 
        
                   3       Beth      Louise 
        
                   >>> df.value_counts() 
        
                   first_name  middle_name 
        
                   Beth        Louise         1 
        
                   John        Smith          1 
        
                   dtype: int64 
        
                   >>> df.value_counts(dropna=False) 
        
                   first_name  middle_name 
        
                   Anne        NaN            1 
        
                   Beth        Louise         1 
        
                   John        Smith          1 
        
                               NaN            1 
        
                   dtype: int64 
        
                   """ 
        
                   if subset is None: 
        
                       subset = self.columns.tolist() 
        
                   counts = self.groupby(subset, dropna=dropna).grouper.size() 
        
                   if sort: 
        
                       counts = counts.sort_values(ascending=ascending) 
        
                   if normalize: 
        
                       counts /= counts.sum() 
        
                   # Force MultiIndex for single column 
        
                   if len(subset) == 1: 
        
                       counts.index = MultiIndex.from_arrays( 
        
                           [counts.index], names=[counts.index.name] 
        
                       ) 
        
                   return counts

Suggested fix for documentation

Align the doc string to match the code logic.

I suggest not to mention "unique rows" or "unique combinations" and maybe mention the equivalence to groupby+size with the optional normalization.

Will be happy to submit a PR for that, or share my two cents in the discussion.

The text was updated successfully, but these errors were encountered:

WooilKim · 2022-09-19T08:52:10Z

Let me work on this issue.

maciejskorski · 2022-09-19T08:59:22Z

Cool. I am reporting this as I have seen quite a few people confused by the behaviour. Let me know if I can be of any help (review, discuss).

WooilKim · 2022-09-19T09:02:20Z

@maciejskorski
Thanks.
I'll share the progress soon.

phofl · 2022-09-19T20:38:43Z

I don't understand the issue here. Why do you think that value_counts does not count unique rows? That this is done via groupby is an implementation detail that should not be visible in the docs

rhshadrach · 2022-09-19T20:40:27Z

Thanks for the report! I think we should stick to a description that doesn't rely on other parts of pandas (e.g. groupby) as much as possible. Otherwise it may be difficult for new users to understand what this method should do.

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

This is not how I read the docstring, but I can see that it can be interpreted that way! To me, Return a Series containing counts of unique rows in the DataFrame. means "For each unique row in the DataFrame, report the number of times it occurs".

What about something like:

Return a Series containing the number of times each unique row occurs in the DataFrame.

maciejskorski · 2022-09-19T21:52:58Z

Thanks for the report! I think we should stick to a description that doesn't rely on other parts of pandas (e.g. groupby) as much as possible. Otherwise it may be difficult for new users to understand what this method should do.

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

This is not how I read the docstring, but I can see that it can be interpreted that way! To me, Return a Series containing counts of unique rows in the DataFrame. means "For each unique row in the DataFrame, report the number of times it occurs".

What about something like:

Return a Series containing the number of times each unique row occurs in the DataFrame.

Better to my taste! This wording better separates “unique row” from “occurrences”.
Otherwise, it was too close to “unique occurrences” aka “count distinct”.

Maybe even "unique row -> unique row value". To emphasize counting of row values (tuples), rather than rows themselves (row can be seen unique merely by indexing). Also, row value is a well-known term (from SQL standard).

rhshadrach · 2022-09-20T03:31:40Z

Also, row value is a well-known term (from SQL standard).

But I don't believe it is used much in pandas - the two occurrences I see in the docs both explain the term. It seems better to me to indicate the index is ignored rather than using this terminology.

maciejskorski · 2022-09-20T08:24:51Z

I don't insist. I acknowledge that pandas is used by people with different backgrounds.

So please re-read my request as asking for decoupling “unique” and “count”, like proposed above, that would do the job.

Sorry for that verbose communication, but I come with research background where the choice of words means a lot :-)

adejumoridwan · 2023-06-19T13:49:40Z

I would like to take on this issue

dbalagula · 2023-07-18T15:03:17Z

Take, if still available

adejumoridwan · 2023-07-18T15:09:35Z

Take

maciejskorski added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2022

github-actions bot assigned WooilKim Sep 19, 2022

rhshadrach added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Sep 19, 2022

topper-123 added good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2023

github-actions bot assigned adejumoridwan Jul 18, 2023

ArchPh03nix mentioned this issue Jul 18, 2023

DOC: changed description of value_counts function #54180

Merged

5 tasks

mroeschke closed this as completed in #54180 Jul 20, 2023

adejumoridwan removed their assignment Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: `value_counts` description doesn't match code logic #48635

DOC: `value_counts` description doesn't match code logic #48635

maciejskorski commented Sep 19, 2022 •

edited

Loading

WooilKim commented Sep 19, 2022 •

edited

Loading

maciejskorski commented Sep 19, 2022

WooilKim commented Sep 19, 2022

phofl commented Sep 19, 2022

rhshadrach commented Sep 19, 2022

maciejskorski commented Sep 19, 2022 •

edited

Loading

rhshadrach commented Sep 20, 2022

maciejskorski commented Sep 20, 2022 •

edited

Loading

adejumoridwan commented Jun 19, 2023

dbalagula commented Jul 18, 2023

adejumoridwan commented Jul 18, 2023

DOC: value_counts description doesn't match code logic #48635

DOC: value_counts description doesn't match code logic #48635

Comments

maciejskorski commented Sep 19, 2022 • edited Loading

Pandas version checks

Location of the documentation

Documentation problem

Suggested fix for documentation

WooilKim commented Sep 19, 2022 • edited Loading

maciejskorski commented Sep 19, 2022

WooilKim commented Sep 19, 2022

phofl commented Sep 19, 2022

rhshadrach commented Sep 19, 2022

maciejskorski commented Sep 19, 2022 • edited Loading

rhshadrach commented Sep 20, 2022

maciejskorski commented Sep 20, 2022 • edited Loading

adejumoridwan commented Jun 19, 2023

dbalagula commented Jul 18, 2023

adejumoridwan commented Jul 18, 2023

DOC: `value_counts` description doesn't match code logic #48635

DOC: `value_counts` description doesn't match code logic #48635

maciejskorski commented Sep 19, 2022 •

edited

Loading

WooilKim commented Sep 19, 2022 •

edited

Loading

maciejskorski commented Sep 19, 2022 •

edited

Loading

maciejskorski commented Sep 20, 2022 •

edited

Loading