-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160
Conversation
pandas/core/groupby/generic.py
Outdated
f"Column label '{name}' is duplicate of result column" | ||
) | ||
columns = com.fill_missing_names(columns) | ||
values = result.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result is a Series at this point? ._values is generally preferable to .values, as the latter will cast dt64tz to ndarray[object]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I decided to do this differently following a suggestion from a while ago (#44755 (comment))
I really wanted these methods to be public, but it is not happening...
Anyway, this way is more DRY and I shouldn't really be writing my own version of reset_index().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't reinvent the wheel
I don't know what you mean. I haven't reinvented anything. I just added private methods with an additional parameter as was suggested in #44755 (comment). No new code at all and no change to the public API. |
adding private methods just explodes things. you can simply use concat internally (e.g. concat the index to the frame itself). |
This reverts commit 9093374.
…om/johnzangwill/pandas into value_counts-with-duplicate-labels
…om/johnzangwill/pandas into value_counts-with-duplicate-labels
I did not want to use private methods, nor to reinvent the wheel! So now this is basically a reinvented
I could have used Perhaps this case illuminates the discussion regarding column label duplicates. Here, the caller wants fine control over the duplicate creation policy, irrespective of any overall setting in the dataframe flags. |
So I think that my way is optimal, but I am always open to suggestions. Perhaps fixing MultiIndex.to_frame as a separate PR would be an option. But I would still need to go through the columns fixing the dtypes (as reset_index does) |
#45061 introduces Index[bool]. Would that get us within sight of a solution? |
That would help by reducing many of the calls to IMO the "correct" solution to this PR is my #44755, whch currently seems irretrievably blocked due to the @jreback demands using concat instead of my code here. But I explained above why that is not working. Fixing MultiIndex.to_frame (#45245) would help, and your #45061 would simplify that slightly. I am inclined to launch a PR for #45245 but omitting the bool fix, which would go away once #45061 was merged. But I will need to add the dreaded |
pandas/core/groupby/generic.py
Outdated
level_values = lib.maybe_convert_objects( | ||
cast(np.ndarray, level_values) | ||
) | ||
result_frame.insert(i, column, level_values, allow_duplicates=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this almost O(n^2) inefficient.
so what you need to do is
- if we don't have duplicates in the columns names, concat and are done
- if we do, then a) should this really be the answer? why are we not raising?
- if we actually do allow this then you simply change the column names to a range index, concat, then set them again.
pls don't conflate this issue with reset_index or allow_duplicates, they are completely orthogonal.
this is not going to move forward this keeps re-inventing the wheel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not O(n^2)! But O(n), and of course I would rather not be itererating through an axis. That is the whole point of Pandas!
concat does not work, as I explained many times. The best I could do was:
result_frame = index.set_names(range(len(columns))).to_frame()
result_frame = concat([result_frame, result], axis=1).reset_index(drop=True).set_axis(columns + [name], axis=1)
which fails 6 of my tests due to bool/object problems in MultiIndex. These will probably be fixed by #45061, but I see that has been deferred until 1.5.
Meanwhile, I employed your column renaming suggestion and there is no more looping. It is now all green (apart from the usual ci problems)
Looking at this, I don't see any immediately obvious way to make this simpler than what @johnzangwill has done. IIUC this can be simplified if/when the PRs mentioned here are merged. If we are optimistic about those going in soon-ish, then this can go on the backburner until then (and still go in for 1.5). Otherwise, we should merge with this approach and trust @johnzangwill to follow-up with simplifications as they become feasible. |
@jbrockmendel, I am not optimistic about my #45160 getting merged soon! @jreback strongly disagrees with my approach, which follows suggestions from other reviewers. What I would like would be for the reviewers to sort this out between them! |
pandas/core/groupby/generic.py
Outdated
@@ -1754,10 +1754,22 @@ def value_counts( | |||
level=index_level, sort_remaining=False | |||
) | |||
|
|||
if not self.as_index: | |||
if self.as_index: | |||
return result.__finalize__(self.obj, method="value_counts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do result_frame = result
and then do the __finalize__
after the if/then (e.g. share it between these)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/groupby/generic.py
Outdated
f"Column label '{name}' is duplicate of result column" | ||
) | ||
result.name = name | ||
result.index = index.set_names(range(len(columns))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this seems way more complicated here. can you not simply
result_frame.columns = ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…om/johnzangwill/pandas into value_counts-with-duplicate-labels
thanks @johnzangwill |
DataFrameGroupBy.value_counts()
fails ifas_index=False
and there are duplicate column labels #44992DataFrameGroupBy.value_counts() now succedes if there are duplicate column labels. In addition, it fails if
as_index=False
and column labels clash with the results column, either "count" or "proportion".