BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160

johnzangwill · 2022-01-01T18:50:40Z

closes BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #44992
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

DataFrameGroupBy.value_counts() now succedes if there are duplicate column labels. In addition, it fails if as_index=False and column labels clash with the results column, either "count" or "proportion".

jbrockmendel · 2022-01-01T22:06:50Z

pandas/core/groupby/generic.py

+                        f"Column label '{name}' is duplicate of result column"
+                    )
+                columns = com.fill_missing_names(columns)
+                values = result.values


result is a Series at this point? ._values is generally preferable to .values, as the latter will cast dt64tz to ndarray[object]

Thanks!
I decided to do this differently following a suggestion from a while ago (#44755 (comment))
I really wanted these methods to be public, but it is not happening...
Anyway, this way is more DRY and I shouldn't really be writing my own version of reset_index().

jreback

pls don't reinvent the wheel

johnzangwill · 2022-01-03T08:49:44Z

pls don't reinvent the wheel

I don't know what you mean. I haven't reinvented anything. I just added private methods with an additional parameter as was suggested in #44755 (comment). No new code at all and no change to the public API.
Otherwise, please tell me how you would suggest that this is done.

pandas/core/series.py

pandas/tests/groupby/test_frame_value_counts.py

jreback · 2022-01-03T22:46:20Z

pls don't reinvent the wheel

I don't know what you mean. I haven't reinvented anything. I just added private methods with an additional parameter as was suggested in #44755 (comment). No new code at all and no change to the public API. Otherwise, please tell me how you would suggest that this is done.

adding private methods just explodes things. you can simply use concat internally (e.g. concat the index to the frame itself).

This reverts commit 9093374.

…om/johnzangwill/pandas into value_counts-with-duplicate-labels

johnzangwill · 2022-01-04T14:11:40Z

pls don't reinvent the wheel

I don't know what you mean. I haven't reinvented anything. I just added private methods with an additional parameter as was suggested in #44755 (comment). No new code at all and no change to the public API. Otherwise, please tell me how you would suggest that this is done.

adding private methods just explodes things. you can simply use concat internally (e.g. concat the index to the frame itself).

I did not want to use private methods, nor to reinvent the wheel!
As you know, I wanted to add allow_duplicates to reset_index(). That would make this a one liner.

So now this is basically a reinvented reset_index().
BTW I could not use Series.to_frame() because:

It throws away duplicate level names
MultiIndex has thrown away some dtypes, so I need to set each column

I could have used concat, but then I would have had to add the name and the index and then remove the index at the end. insert was simpler and neater.

Perhaps this case illuminates the discussion regarding column label duplicates. Here, the caller wants fine control over the duplicate creation policy, irrespective of any overall setting in the dataframe flags.

pandas/core/groupby/generic.py

johnzangwill · 2022-01-06T11:00:21Z

@jreback

concat the index to the frame itself

can you not just concat?

concat does not take an Index argument.
Index.to_frame does not handle repeated labels.
MultiIndex does not handle bool dtype.
...

So I think that my way is optimal, but I am always open to suggestions. Perhaps fixing MultiIndex.to_frame as a separate PR would be an option. But I would still need to go through the columns fixing the dtypes (as reset_index does)

jbrockmendel · 2022-01-07T22:14:11Z

MultiIndex does not handle bool dtype.

#45061 introduces Index[bool]. Would that get us within sight of a solution?

johnzangwill · 2022-01-08T10:22:47Z

#45061 introduces Index[bool]. Would that get us within sight of a solution?

That would help by reducing many of the calls to lib.maybe_convert_objects scattered around the system. It would remove a few lines from this PR.

IMO the "correct" solution to this PR is my #44755, whch currently seems irretrievably blocked due to the allow_duplicates parameter. Then it would be just one line: result.reset_index(allow_duplicates=True)

@jreback demands using concat instead of my code here. But I explained above why that is not working. Fixing MultiIndex.to_frame (#45245) would help, and your #45061 would simplify that slightly.

I am inclined to launch a PR for #45245 but omitting the bool fix, which would go away once #45061 was merged. But I will need to add the dreaded allow_duplicates parameter to MultiIndex.to_frame...

jreback · 2022-01-08T17:13:54Z

pandas/core/groupby/generic.py

+                        level_values = lib.maybe_convert_objects(
+                            cast(np.ndarray, level_values)
+                        )
+                    result_frame.insert(i, column, level_values, allow_duplicates=True)


this almost O(n^2) inefficient.
so what you need to do is

if we don't have duplicates in the columns names, concat and are done

if we do, then a) should this really be the answer? why are we not raising?

if we actually do allow this then you simply change the column names to a range index, concat, then set them again.

pls don't conflate this issue with reset_index or allow_duplicates, they are completely orthogonal.

this is not going to move forward this keeps re-inventing the wheel.

Not O(n^2)! But O(n), and of course I would rather not be itererating through an axis. That is the whole point of Pandas!

concat does not work, as I explained many times. The best I could do was:

result_frame = index.set_names(range(len(columns))).to_frame() result_frame = concat([result_frame, result], axis=1).reset_index(drop=True).set_axis(columns + [name], axis=1)

which fails 6 of my tests due to bool/object problems in MultiIndex. These will probably be fixed by #45061, but I see that has been deferred until 1.5.

Meanwhile, I employed your column renaming suggestion and there is no more looping. It is now all green (apart from the usual ci problems)

jbrockmendel · 2022-01-11T18:19:33Z

Looking at this, I don't see any immediately obvious way to make this simpler than what @johnzangwill has done.

IIUC this can be simplified if/when the PRs mentioned here are merged. If we are optimistic about those going in soon-ish, then this can go on the backburner until then (and still go in for 1.5). Otherwise, we should merge with this approach and trust @johnzangwill to follow-up with simplifications as they become feasible.

johnzangwill · 2022-01-11T19:59:38Z

If we are optimistic about those going in soon-ish

@jbrockmendel, I am not optimistic about my #45160 getting merged soon! @jreback strongly disagrees with my approach, which follows suggestions from other reviewers. What I would like would be for the reviewers to sort this out between them!
It could also be the case that #45061 together with my new #45318 would enable this to be completely vectorized using concat. But the correct place would be within reset_index, so it would still need #45160.

jreback · 2022-01-16T17:32:04Z

pandas/core/groupby/generic.py

@@ -1754,10 +1754,22 @@ def value_counts(
                    level=index_level, sort_remaining=False
                )

-            if not self.as_index:
+            if self.as_index:
+                return result.__finalize__(self.obj, method="value_counts")


do result_frame = result and then do the __finalize__ after the if/then (e.g. share it between these)

pandas/core/groupby/generic.py

jreback · 2022-01-16T17:32:59Z

pandas/core/groupby/generic.py

+                        f"Column label '{name}' is duplicate of result column"
+                    )
+                result.name = name
+                result.index = index.set_names(range(len(columns)))


again this seems way more complicated here. can you not simply
result_frame.columns = ....

…om/johnzangwill/pandas into value_counts-with-duplicate-labels

jreback · 2022-01-17T14:00:58Z

thanks @johnzangwill

johnzangwill added 3 commits December 31, 2021 21:46

Update test_frame_value_counts.py

696130b

Implement value_counts with duplicates and add test

6b03989

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

db2f38a

jbrockmendel reviewed Jan 1, 2022

View reviewed changes

johnzangwill marked this pull request as draft January 2, 2022 14:37

redo using private methods

9093374

johnzangwill marked this pull request as ready for review January 2, 2022 21:47

johnzangwill requested a review from jbrockmendel January 2, 2022 23:43

jreback requested changes Jan 3, 2022

View reviewed changes

johnzangwill requested a review from jreback January 3, 2022 08:58

jbrockmendel reviewed Jan 3, 2022

View reviewed changes

pandas/core/series.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jan 3, 2022

View reviewed changes

pandas/tests/groupby/test_frame_value_counts.py Show resolved Hide resolved

johnzangwill added 2 commits January 3, 2022 17:29

Update test_frame_value_counts.py

4f65829

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

85cf095

johnzangwill requested a review from jbrockmendel January 3, 2022 17:47

jreback added the Groupby label Jan 3, 2022

johnzangwill added 9 commits January 4, 2022 09:41

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

68ae88b

Revert "redo using private methods"

faa17e5

This reverts commit 9093374.

Merge branch 'value_counts-with-duplicate-labels' of https://github.c…

44ff075

…om/johnzangwill/pandas into value_counts-with-duplicate-labels

Update generic.py

c097e5d

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

7532cc0

Update generic.py

d490187

Merge branch 'value_counts-with-duplicate-labels' of https://github.c…

837c850

…om/johnzangwill/pandas into value_counts-with-duplicate-labels

Update generic.py

89c90c4

Update generic.py

92999fb

jreback requested changes Jan 4, 2022

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

johnzangwill requested a review from jreback January 4, 2022 14:28

jreback reviewed Jan 8, 2022

View reviewed changes

johnzangwill added 3 commits January 9, 2022 17:17

Back to reset_index

6e55670

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

a47bbf7

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

ad164dc

johnzangwill requested a review from jreback January 10, 2022 10:51

johnzangwill added 2 commits January 10, 2022 20:11

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

0f4b155

Merge branch 'pandas-dev:master' into value_counts-with-duplicate-labels

f6be00d

jreback requested changes Jan 16, 2022

View reviewed changes

johnzangwill added 5 commits January 16, 2022 22:44

Merge branch 'main' into value_counts-with-duplicate-labels

0b4853c

Update generic.py

fece32b

Merge branch 'pandas-dev:main' into value_counts-with-duplicate-labels

df127ef

Trigger CI

36f2b0d

Merge branch 'value_counts-with-duplicate-labels' of https://github.c…

207b55e

…om/johnzangwill/pandas into value_counts-with-duplicate-labels

johnzangwill requested a review from jreback January 16, 2022 23:43

Merge branch 'pandas-dev:main' into value_counts-with-duplicate-labels

e3b245c

jreback added this to the 1.5 milestone Jan 17, 2022

jreback approved these changes Jan 17, 2022

View reviewed changes

jreback merged commit 439906e into pandas-dev:main Jan 17, 2022

johnzangwill deleted the value_counts-with-duplicate-labels branch January 17, 2022 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160

BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160

johnzangwill commented Jan 1, 2022

jbrockmendel Jan 1, 2022

johnzangwill Jan 2, 2022

jreback left a comment

johnzangwill commented Jan 3, 2022 •

edited

Loading

jreback commented Jan 3, 2022

johnzangwill commented Jan 4, 2022

johnzangwill commented Jan 6, 2022 •

edited

Loading

jbrockmendel commented Jan 7, 2022

johnzangwill commented Jan 8, 2022

jreback Jan 8, 2022

johnzangwill Jan 10, 2022 •

edited

Loading

jbrockmendel commented Jan 11, 2022

johnzangwill commented Jan 11, 2022

jreback Jan 16, 2022

johnzangwill Jan 16, 2022

jreback Jan 16, 2022

johnzangwill Jan 16, 2022

jreback commented Jan 17, 2022

BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160

BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160

Conversation

johnzangwill commented Jan 1, 2022

jbrockmendel Jan 1, 2022

Choose a reason for hiding this comment

johnzangwill Jan 2, 2022

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

johnzangwill commented Jan 3, 2022 • edited Loading

jreback commented Jan 3, 2022

johnzangwill commented Jan 4, 2022

johnzangwill commented Jan 6, 2022 • edited Loading

jbrockmendel commented Jan 7, 2022

johnzangwill commented Jan 8, 2022

jreback Jan 8, 2022

Choose a reason for hiding this comment

johnzangwill Jan 10, 2022 • edited Loading

Choose a reason for hiding this comment

jbrockmendel commented Jan 11, 2022

johnzangwill commented Jan 11, 2022

jreback Jan 16, 2022

Choose a reason for hiding this comment

johnzangwill Jan 16, 2022

Choose a reason for hiding this comment

jreback Jan 16, 2022

Choose a reason for hiding this comment

johnzangwill Jan 16, 2022

Choose a reason for hiding this comment

jreback commented Jan 17, 2022

johnzangwill commented Jan 3, 2022 •

edited

Loading

johnzangwill commented Jan 6, 2022 •

edited

Loading

johnzangwill Jan 10, 2022 •

edited

Loading