Skip to content

BUG: groupby.hist legend should use group keys #33493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 22, 2020

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Apr 12, 2020

cc @TomAugspurger

Added argument "legend" to histogram backend. This adds the ability to display a legend even when not using a groupby.

df = pd.DataFrame(np.random.randn(30, 2), columns=['A', 'B'])
df['C'] = 15 * ['a'] + 15 * ['b']
df = df.set_index('C')
df.groupby('C')['A'].hist(legend=True)

produces

image

I went off the tests that already existed in test_hist_method and test_groupby, but perhaps more stringent tests should be added. I've been manually checking with the following code:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame(np.random.randn(30, 2), columns=['A', 'B'])
df['C'] = 15 * ['a'] + 15 * ['b']
df = df.set_index('C')

df.groupby('C')['A'].hist(legend=False)
plt.show()
df.groupby('C')['A'].hist(legend=True)
plt.show()
df.A.hist(by='C', legend=False)
plt.show()
df.A.hist(by='C', legend=True)
plt.show()
plt.show()
df.hist(by='C', legend=False)
plt.show()
df.hist(by='C', legend=True)
plt.show()
df.hist(by='C', column='B', legend=False)
plt.show()
df.hist(by='C', column='B', legend=True)
plt.show()

df.groupby('C')['A'].hist(label='D', legend=False)
plt.show()
df.groupby('C')['A'].hist(label='D', legend=True)
plt.show()
df.A.hist(by='C', label='D', legend=False)
plt.show()
df.A.hist(by='C', label='D', legend=True)
plt.show()
df.hist(by='C', label='D', legend=False)
plt.show()
df.hist(by='C', label='D', legend=True)
plt.show()
df.hist(by='C', column='B', label='D', legend=False)
plt.show()
df.hist(by='C', column='B', label='D', legend=True)
plt.show()

@alimcmaster1
Copy link
Member

Thanks for the PR!

@rhshadrach you still interested in working on this? Mind trying to fix up the tests if so?

@rhshadrach
Copy link
Member Author

@alimcmaster1 Yep - this is still active, thanks for the response. I'd be happy to improve the tests, but I'm not sure how. Do you have any recommendations?

@rhshadrach
Copy link
Member Author

@alimcmaster1 I've improved the tests in test_hist_method. test_groupby on the other hand seems to only make sure that the groupby method works. I'm guessing the idea is to make sure groupby works there, and then leave the details to the other plotting tests. With this, I've left that test as-is (aside from minor formatting changes). But let me know if you think that test should be improved somehow.

Copy link
Member

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR.

you also need a whatsnew note in 1.1

@rhshadrach
Copy link
Member Author

The introduced test fails on an older version of matplotlib. The exact line it fails on is

labels = [six.text_type(lab) for lab in label]

This issue is fixed in matplotlib master by converting the label to a string

 labels = [] if label is None else np.atleast_1d(np.asarray(label, str))

with this, I feel pretty comfortable changing the test labels to string to get them to pass.

@rhshadrach
Copy link
Member Author

@charlesdong1991 Thanks for the review, changes made and checks pass. For the whatsnew, it seemed to me to be a toss up as whether this is a bug or enhancement; I went with enhancement. I also added one additional test for SeriesGroupby.hist.

Copy link
Member

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good! @rhshadrach

only two nitpicks, otherwise LGTM!

cc @jreback @WillAyd for reviews

kwargs : dict, keyword arguments passed to matplotlib.Axes.hist

Returns
-------
collection of Matplotlib Axes
"""

if legend and "label" not in kwargs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why someone would pass legend and label together? Should this not just raise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I think is to rename for output:

index = 3 * ['1'] + 3 * ['2']
df = DataFrame([(3, 4)]*3 + [(5, 6)]*3, index=index, columns=['a', 'b'])
df.index.names = ['c']
ret = df.hist(by='c', legend=True, label=['Nice Name 1', 'Nice Name 2'])

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just raise instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we would then be raising on the only case where label has any effect - or am I missing another case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't understand the reason why we would raise here when not raising makes it easier for the end user have different names appear in the legend. What does raising gain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my thought is that If a user wants particular labels they should just provide that to the grouper; adding this here is just another way of doing things so just API clout

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would completely agree if this was implemented within pandas. However it's being passed through to matplotlib via kwargs. Disabling certain kwargs from working and allowing others to go through seems hazardous to user expectations. But even if this is unconvincing, I think we should strive to be consistent with various plots:

index = 3 * ['1'] + 3 * ['2']
df = DataFrame([(3, 4)]*3 + [(5, 6)]*3, index=index, columns=['a', 'b'])
df.index.names = ['c']
ret = df.plot(y=['a', 'b'], label=['Nice Name 1', 'Nice Name 2'])

image

Perhaps an issue could be raised as to how labels should be treated across all plots?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well shoot. As a counter-argument to what I just said. DataFrame.plot.hist ignores label altogether. I suppose if there is one plotting function we should be consistent with here, it's that one. What do you think about not raising, but simply ignoring any label kwarg?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should raise instead of ignoring; ignoring leads to confusion or surprising behavior, so always better to be explicit

@rhshadrach
Copy link
Member Author

@WillAyd, @charlesdong1991 Thanks for the reviews - changes made and checks pass.

@charlesdong1991
Copy link
Member

nice job, lgtm

cc @jreback @WillAyd

kwargs : dict, keyword arguments passed to matplotlib.Axes.hist

Returns
-------
collection of Matplotlib Axes
"""

if legend and "label" not in kwargs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just raise instead

@rhshadrach
Copy link
Member Author

@WillAyd changes made, use of label and legend now raises.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. have a merge conflict and just a few things to clean up

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2020 via email

@rhshadrach
Copy link
Member Author

@WillAyd Changes made and checks pass. In addition to the splitting out the tests, I also improve the logic involving setting the label. Instead of using isinstance, it now uses ndim. Also the logic is flat instead of nested.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments otherwise lgtm. @TomAugspurger or @jorisvandenbossche if you care to look

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice overall.

One question about the expected behavior though: why don't we use a user-provided label when legend=True?

kwargs : dict, keyword arguments passed to matplotlib.Axes.hist

Returns
-------
collection of Matplotlib Axes
"""
if legend:
assert "label" not in kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pandas the only one calling this, or could users call it with labels in kwargs?

Users shouldn't see bare asserts like this. It should be a ValueError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just to verify the expected behavior, if the user provides a label why wouldn't we want to use that? Can we just do kwargs.setdefault("label", data.name) (or the other cases)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first question, this function is only called from hist_frame and hist_series, each of which raise a ValueError on this condition. This assert is to maintain this behavior if in the future any other code paths call it.

For the second question, see #33493 (comment)

@@ -291,6 +291,7 @@ Other enhancements
- :meth:`groupby.transform` now allows ``func`` to be ``pad``, ``backfill`` and ``cumcount`` (:issue:`31269`).
- :meth:`~pandas.io.json.read_json` now accepts `nrows` parameter. (:issue:`33916`).
- :meth `~pandas.io.gbq.read_gbq` now allows to disable progress bar (:issue:`33360`).
- :meth:`DataFrame.hist`, :meth:`Series.hist`, :meth:`DataFrameGroupby.hist`, and :meth:`SeriesGroupby.hist` have gained the ``legend`` argument. Set to True to show a legend in the histogram. (:issue:`6279`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrameGroupBy and SeriesGroupBy aren't in the top-level. It should be something like core.groupby.DataFrameGroupBy. See doc/source/referenece.groupby.rst for the right path..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - thanks!

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2020

One question about the expected behavior though: why don't we use a user-provided label when legend=True?

Talked about this starting here: #33493 (comment) so I specifically asked to have it raise. Happy to be overridden if you think better otherwise

@rhshadrach
Copy link
Member Author

@TomAugspurger: I've fixed the issues you pointed out in the whatsnew. What's still uncertain is how to handle label when legend=True. From #33493 (comment), I was for supporting and not overwriting label whereas @WillAyd was opposed to supporting because there was another way to set the labels (namely, changing the column names of the dataframe). Would like to get your thoughts here.

@rhshadrach
Copy link
Member Author

@TomAugspurger friendly ping. Just want to let you know I'll be unavailable starting the 24th until July 5th in case we want to get this into 1.1. No issue with it waiting until 1.2 on my end if the timing doesn't work out.

@TomAugspurger
Copy link
Contributor

Don't have strong thoughts. Let's just keep it as on this PR and adjust based on user feedback if it arises.

@WillAyd WillAyd added this to the 1.1 milestone Jun 22, 2020
@WillAyd WillAyd merged commit 506eb54 into pandas-dev:master Jun 22, 2020
@WillAyd
Copy link
Member

WillAyd commented Jun 22, 2020

Thanks @rhshadrach

@rhshadrach rhshadrach deleted the hist_legend branch July 11, 2020 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG/VIS: groupby.hist/plot() should pass group keys as labels
5 participants