Skip to content

BUG: Fix error for boxplot when using a pre-grouped DataFrame with more than one grouping #57985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 31, 2024

Conversation

thetestspecimen
Copy link
Contributor

@thetestspecimen thetestspecimen commented Mar 24, 2024

The original bug shows an indexing error when using boxplot on a pre-grouped DataFrame. For example:

df.groupby(['X', 'Y']).boxplot()

This error occurs as multiple groupings are passed as a tuple into the index of a Series, which is not allowed. The fix converts the tuple to a string, which is acceptable as an index to a Series.

There are other comments on the bug thread that state "Groupby boxplot is not working even when a single 'by' is used", which is not the case.

I have included in the tests both a single by and multiple by covering three different methods of plotting a boxplot from a DataFrame:

  1. df.boxplot(by='X')
  2. df.boxplot(by=['X','Y'])
  3. df.plot.box(by='X')
  4. df.plot.box(by=['X','Y'])
  5. df.groupby('X').boxplot()
  6. df.groupby(['X','Y']).boxplot()

Of all of the above, only example 6 required the fix.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looking good.

ret.loc[key] = d
# GH 14701 'key' needs to be converted to text as 'key' is a tuple when
# there is more than one group
ret.loc[pprint_thing(key)] = d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ret = pd.Series(dtype=object) above, can you do:

data = {}
for (key, group), ax in zip(grouped, axes):
    ...
    data[key] = d
ret = pd.Series(data)

In general one should not construct a Series row-by-row (this is very slow), but rather provide all the data upfront. I think this will also allow tuples for key.

)
df["X"] = Series(np.repeat(["A", "B"], int(rows / 2)))
df["Y"] = Series(np.tile(["C", "D"], int(rows / 2)))
pregrouped = df.groupby(group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just to adhere to existing conventions in tests, could you either call this grouped or gb. These are the names used in other tests here.

@@ -375,7 +375,7 @@ Period

Plotting
^^^^^^^^
-
- Bug in :meth:`DataFrameGroupBy.boxplot` passes a ``tuple`` instead of a ``string`` when input ``DataFrame`` is pre-grouped using more than one ``column`` (:issue:`14701`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to the API docs won't resolve unless you include a . before DataFrameGroupBy; so it should be :meth:`.DataFrameGroupBy.boxplot`

@rhshadrach rhshadrach added this to the 3.0 milestone Mar 30, 2024
@thetestspecimen
Copy link
Contributor Author

Thanks for the PR! Looking good.

Thanks for the feedback.

I will make the necessary adjustments and update the PR accordingly.

# GH 14701 refactored to allow the 'key' to be passed as a tuple,
# which occurs when there is more than one group
data[key] = d
ret = pd.Series(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to be creating ret on every loop, only after the for loop is done. This line can be dedented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

Comment on lines 542 to 543
# GH 14701 refactored to allow the 'key' to be passed as a tuple,
# which occurs when there is more than one group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should be removed.

@@ -409,7 +409,7 @@ Period

Plotting
^^^^^^^^
-
- Bug in :meth:`.DataFrameGroupBy.boxplot` passes a ``tuple`` instead of a ``string`` when input ``DataFrame`` is pre-grouped using more than one ``column`` (:issue:`14701`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer accurate. I'd recommend something more generic, such as

Bug in :meth:`.DataFrameGroupBy.boxplot` failed when there were multiple groupings (:issue:14701)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I will update it.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit 73fd026 into pandas-dev:main Mar 31, 2024
46 checks passed
@rhshadrach
Copy link
Member

Thanks @thetestspecimen - very nice!

@thetestspecimen
Copy link
Contributor Author

Thanks @thetestspecimen - very nice!

No Problem @rhshadrach. Thanks for the detailed input and guidance.

@thetestspecimen thetestspecimen deleted the 14701_boxplot_groupby branch March 31, 2024 16:08
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby boxplot doesn't work if multiple by's are used
2 participants