Skip to content

BUG: Fixes #36918 boxplots with matplotlib #37107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

francibm97
Copy link

Fixes #36918

I believe that the @TomAugspurger in #35393 found out that, when sharex or sharey are enabled, matplotlib counts the number of ticks of an axis in a subplot as the total number of ticks of the shared axis in all subplots. Since matplotlib became stricter with set_xticks, the proposed workaround was replicating the labels n times where n is "total number of ticks"/"number of ticks of this subplot"="number of subplots".

However, that fix was implemented only for vertical boxplots and not for horizontal boxplots, hence why the issue.
Also, that fix causes the labels of the shared axis to be drawn n times (where n is the number of subplots) in some cases, see below (matplotlib 3.3.2, code adapted from the one posted by the author of issue 36918)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

df1 = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

f, axs = plt.subplots(nrows=2, ncols=1, figsize=(10,7), sharex=True)
df1.boxplot(ax=axs[0], vert=True)
df2.boxplot(ax=axs[1], vert=True)

plt.show()

before-3 3 2

In this fix I used set_major_formatter with a FixedFormatter which fixes the issue and does not overdraw the labels:

after-3 3 2

Copy link
Member

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need a test and a whatnew note

@francibm97
Copy link
Author

Sorry I've now added a test and a note in the 1.2.0 rst

Copy link
Member

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, the change looks very nice!!

btw, just a question to clarify, are you basically solving two issues in this PR: one is error raised when vert=False, and the other one is overlapping/duplicate labels when sharex/y=True

if so, maybe clarify it in the whatsnew, and better to make two tests, one to have df and a single plot with vert=False to prove it works, the other test which is similar to what you have now to show the duplicate/overlapping labels issue is gone?

@@ -437,6 +437,7 @@ Plotting
- Bug in :meth:`DataFrame.plot` was rotating xticklabels when ``subplots=True``, even if the x-axis wasn't an irregular time series (:issue:`29460`)
- Bug in :meth:`DataFrame.plot` where a marker letter in the ``style`` keyword sometimes causes a ``ValueError`` (:issue:`21003`)
- Twinned axes were losing their tick labels which should only happen to all but the last row or column of 'externally' shared axes (:issue:`33819`)
- Bug in :meth:`DataFrame.boxplot` was raising ``ValueError`` when plotting with ``vert=True`` on a subplot with shared axes (:issue:`36918`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh, isn't the original issue about erroring when vert=False?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry my bad I will correct it

Comment on lines +311 to +312
axis.set_major_formatter(FixedFormatter(keys))
ax.tick_params(axis=axis.axis_name, which="major", rotation=rot)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice clean-up!

def test_boxplot_shared_axis(self, vert):
# GH 37107
df1 = DataFrame(np.random.random((100, 5)), columns=["A", "B", "C", "D", "E"])
df2 = DataFrame(np.random.random((100, 5)), columns=["A", "B", "C", "D", "E"])
Copy link
Member

@charlesdong1991 charlesdong1991 Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df2 = DataFrame(np.random.random((100, 5)), columns=["A", "B", "C", "D", "E"])
df2 = df1.copy()

ahh, if it's to generate two dfs with different random numbers, ignore, but still, probably one df is enough

# In order for the ticklabels to be placed, the plot has to be drawn
fig.canvas.draw()

for ax in axes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might not fully understand, why needs two dfs and two plots (in each of axes)? can we just use one?

Comment on lines 251 to 253
assert len(labels) % 5 == 0
assert len(labels) // (len(labels) // 5) == 5
assert labels[:5] == ["A", "B", "C", "D", "E"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you maybe simply it a bit? seems we want to test there are 10 labels while the first 5 is those column names, and last 5 are empty?

@francibm97
Copy link
Author

Thanks for the feedback about the tests, I looked better into the issue of the duplicated and empty ticks and checked better if it was really related to the shared axis or not.

Turns out that it was not.

The problem is that when matplotlib's boxplot is called multiple times on the same axis (or on different shared axes) with manage_ticks=True (which is the default), it will append a new list of locations to the previous one of the FixedLocator without removing duplicates. See https://github.com/matplotlib/matplotlib/blob/c8203fb8ed1a24fb999942340e55ea4a7a9fe6e8/lib/matplotlib/axes/_axes.py#L4066

This is really why there are double ticks when boxplot is called twice (on two shared axes or on the same axis). Therefore the previous workaround would overdraw the labels both in the case of using boxplot on two shared axes and in the case of using boxplot on the same axis twice, because, for what concernes the ticklabels, the two situation are the same.
I leave a sample code to check that behaviour:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.random((100, 5)), columns=["A", "B", "C", "D", "E"])

fig, ax = plt.subplots(1, 1)
df.boxplot(ax=ax, vert=True)
df.boxplot(ax=ax, vert=True)
plt.show()

So now I think that the duplicated ticks are not a feature of the shared axis as I initially thought, but rather a "feature" of matplotlib's boxplot which does not handle well multiple calls on the same axis and which also incidentally is problematic on shared axes. I think that I will rise an issue in the matplotlib repo because, for other plots such as scatter, this problem of double ticks does not arise:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1)
_ = ax.scatter(['a'], [1])
_ = ax.scatter(['a'], [2])
fig.canvas.draw()

print(ax.xaxis.get_ticklabels())
[Text(0, 0, 'a')]


import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1)
_ = ax.boxplot([1,2,3])
_ = ax.boxplot([5,6,7])
fig.canvas.draw()

print(ax.xaxis.get_ticklabels())
[Text(1, 0, '1'), Text(1, 0, '1')]

The problem can be solved by preventing this redundant ticks to show up when multiple call of boxplot are made.
This can be done by setting a new FixedLocator along with the FixedFormatter so that there are no duplicated ticks.
This in turn solves the two things you mentioned:

  • The problem with the horizontal boxplots
  • The overdrawing of the labels

However I have got some questions:

  • If the user specifies manage_ticks=False, should pandas add the labels anyways or not?
  • Is the user supposed to run boxplot multiple times on the same axis? If yes, then the proposed workaround cannot be accepted because something like the code reported below wouldn't work
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.random((100, 5)), columns=["A", "B", "C", "D", "E"])

fig, ax = plt.subplots(1, 1)
df.boxplot(ax=ax)
df.boxplot(ax=ax, positions=[10, 11, 12, 13, 14])
plt.show()
-> Only the second boxplot is labels, the first one has no labels

There could be other solutions (such as removing the duplicates in the FixedLocator)

I do not know if a more complex solution is acceptable if it can work in a wider range of cases,
or if this simple solution is preferred

Sorry for the wall of text and for changing the reviewed code

(TODO: still to update the whatsnew and test)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 16, 2020
@jreback
Copy link
Contributor

jreback commented Nov 17, 2020

@francibm97 if you can merge master and update we can get this reviewed and in

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

closing as stale. if you want to continue, pls ping and can re-open.

@jreback jreback closed this Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Horizontal boxplots on subplots throws ValueError
3 participants