Skip to content

BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

johnzangwill
Copy link
Contributor

DataFrameGroupBy.value_counts() now succedes if there are duplicate column labels. In addition, it fails if as_index=False and column labels clash with the results column, either "count" or "proportion".

f"Column label '{name}' is duplicate of result column"
)
columns = com.fill_missing_names(columns)
values = result.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is a Series at this point? ._values is generally preferable to .values, as the latter will cast dt64tz to ndarray[object]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I decided to do this differently following a suggestion from a while ago (#44755 (comment))
I really wanted these methods to be public, but it is not happening...
Anyway, this way is more DRY and I shouldn't really be writing my own version of reset_index().

@johnzangwill johnzangwill marked this pull request as draft January 2, 2022 14:37
@johnzangwill johnzangwill marked this pull request as ready for review January 2, 2022 21:47
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't reinvent the wheel

@johnzangwill
Copy link
Contributor Author

johnzangwill commented Jan 3, 2022

pls don't reinvent the wheel

I don't know what you mean. I haven't reinvented anything. I just added private methods with an additional parameter as was suggested in #44755 (comment). No new code at all and no change to the public API.
Otherwise, please tell me how you would suggest that this is done.

@johnzangwill johnzangwill requested a review from jreback January 3, 2022 08:58
@jreback
Copy link
Contributor

jreback commented Jan 3, 2022

pls don't reinvent the wheel

I don't know what you mean. I haven't reinvented anything. I just added private methods with an additional parameter as was suggested in #44755 (comment). No new code at all and no change to the public API. Otherwise, please tell me how you would suggest that this is done.

adding private methods just explodes things. you can simply use concat internally (e.g. concat the index to the frame itself).

@jreback jreback added the Groupby label Jan 3, 2022
@johnzangwill
Copy link
Contributor Author

pls don't reinvent the wheel

I don't know what you mean. I haven't reinvented anything. I just added private methods with an additional parameter as was suggested in #44755 (comment). No new code at all and no change to the public API. Otherwise, please tell me how you would suggest that this is done.

adding private methods just explodes things. you can simply use concat internally (e.g. concat the index to the frame itself).

I did not want to use private methods, nor to reinvent the wheel!
As you know, I wanted to add allow_duplicates to reset_index(). That would make this a one liner.

So now this is basically a reinvented reset_index().
BTW I could not use Series.to_frame() because:

  • It throws away duplicate level names
  • MultiIndex has thrown away some dtypes, so I need to set each column

I could have used concat, but then I would have had to add the name and the index and then remove the index at the end. insert was simpler and neater.

Perhaps this case illuminates the discussion regarding column label duplicates. Here, the caller wants fine control over the duplicate creation policy, irrespective of any overall setting in the dataframe flags.

@johnzangwill johnzangwill requested a review from jreback January 4, 2022 14:28
@johnzangwill
Copy link
Contributor Author

johnzangwill commented Jan 6, 2022

@jreback

concat the index to the frame itself

can you not just concat?

  • concat does not take an Index argument.
  • Index.to_frame does not handle repeated labels.
  • MultiIndex does not handle bool dtype.
  • ...

So I think that my way is optimal, but I am always open to suggestions. Perhaps fixing MultiIndex.to_frame as a separate PR would be an option. But I would still need to go through the columns fixing the dtypes (as reset_index does)

@jbrockmendel
Copy link
Member

MultiIndex does not handle bool dtype.

#45061 introduces Index[bool]. Would that get us within sight of a solution?

@johnzangwill
Copy link
Contributor Author

#45061 introduces Index[bool]. Would that get us within sight of a solution?

That would help by reducing many of the calls to lib.maybe_convert_objects scattered around the system. It would remove a few lines from this PR.

IMO the "correct" solution to this PR is my #44755, whch currently seems irretrievably blocked due to the allow_duplicates parameter. Then it would be just one line: result.reset_index(allow_duplicates=True)

@jreback demands using concat instead of my code here. But I explained above why that is not working. Fixing MultiIndex.to_frame (#45245) would help, and your #45061 would simplify that slightly.

I am inclined to launch a PR for #45245 but omitting the bool fix, which would go away once #45061 was merged. But I will need to add the dreaded allow_duplicates parameter to MultiIndex.to_frame...

level_values = lib.maybe_convert_objects(
cast(np.ndarray, level_values)
)
result_frame.insert(i, column, level_values, allow_duplicates=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this almost O(n^2) inefficient.
so what you need to do is

  • if we don't have duplicates in the columns names, concat and are done
  • if we do, then a) should this really be the answer? why are we not raising?
  • if we actually do allow this then you simply change the column names to a range index, concat, then set them again.

pls don't conflate this issue with reset_index or allow_duplicates, they are completely orthogonal.

this is not going to move forward this keeps re-inventing the wheel.

Copy link
Contributor Author

@johnzangwill johnzangwill Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not O(n^2)! But O(n), and of course I would rather not be itererating through an axis. That is the whole point of Pandas!

concat does not work, as I explained many times. The best I could do was:

result_frame = index.set_names(range(len(columns))).to_frame()
result_frame = concat([result_frame, result], axis=1).reset_index(drop=True).set_axis(columns + [name], axis=1)

which fails 6 of my tests due to bool/object problems in MultiIndex. These will probably be fixed by #45061, but I see that has been deferred until 1.5.

Meanwhile, I employed your column renaming suggestion and there is no more looping. It is now all green (apart from the usual ci problems)

@johnzangwill johnzangwill requested a review from jreback January 10, 2022 10:51
@jbrockmendel
Copy link
Member

Looking at this, I don't see any immediately obvious way to make this simpler than what @johnzangwill has done.

IIUC this can be simplified if/when the PRs mentioned here are merged. If we are optimistic about those going in soon-ish, then this can go on the backburner until then (and still go in for 1.5). Otherwise, we should merge with this approach and trust @johnzangwill to follow-up with simplifications as they become feasible.

@johnzangwill
Copy link
Contributor Author

If we are optimistic about those going in soon-ish

@jbrockmendel, I am not optimistic about my #45160 getting merged soon! @jreback strongly disagrees with my approach, which follows suggestions from other reviewers. What I would like would be for the reviewers to sort this out between them!
It could also be the case that #45061 together with my new #45318 would enable this to be completely vectorized using concat. But the correct place would be within reset_index, so it would still need #45160.

@@ -1754,10 +1754,22 @@ def value_counts(
level=index_level, sort_remaining=False
)

if not self.as_index:
if self.as_index:
return result.__finalize__(self.obj, method="value_counts")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do result_frame = result and then do the __finalize__ after the if/then (e.g. share it between these)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

f"Column label '{name}' is duplicate of result column"
)
result.name = name
result.index = index.set_names(range(len(columns)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this seems way more complicated here. can you not simply
result_frame.columns = ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@johnzangwill johnzangwill requested a review from jreback January 16, 2022 23:43
@jreback jreback added this to the 1.5 milestone Jan 17, 2022
@jreback jreback merged commit 439906e into pandas-dev:main Jan 17, 2022
@jreback
Copy link
Contributor

jreback commented Jan 17, 2022

thanks @johnzangwill

@johnzangwill johnzangwill deleted the value_counts-with-duplicate-labels branch January 17, 2022 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels
3 participants