Skip to content

ENH: Allow column grouping in DataFrame.plot #29944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 62 commits into from
Jun 9, 2022
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
469eff8
WIP
NicolasHug Dec 1, 2019
b96a659
minimal test
NicolasHug Dec 1, 2019
c5b187b
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Jan 17, 2020
99cc049
formatted files
NicolasHug Jan 17, 2020
06da910
linting again
NicolasHug Jan 17, 2020
49c728a
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Mar 30, 2020
b370ba5
Check number of lines and labels
NicolasHug Mar 30, 2020
d1a2ae0
Added input checks
NicolasHug Mar 30, 2020
10ac363
some comments
NicolasHug Mar 30, 2020
9774d42
Whatsnew entry
NicolasHug Mar 30, 2020
572de08
More tests
NicolasHug Mar 30, 2020
4519f22
Skip test if no scipy for kde
NicolasHug Mar 30, 2020
c6edb9f
Merge branch 'master' into subplot_groups
NicolasHug Apr 5, 2020
73dd7eb
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Apr 14, 2020
b804e57
Address comments
NicolasHug Apr 14, 2020
d7ee47c
fix docstring
NicolasHug Apr 14, 2020
8fb298d
Added error when columns are fouund in multiple subplots
NicolasHug Apr 14, 2020
d84f356
Merge branch 'subplot_groups' of github.com:NicolasHug/pandas into su…
NicolasHug Apr 14, 2020
ba0f982
sorted imports
NicolasHug Apr 14, 2020
fe98b86
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Jul 21, 2020
f81d295
Added type annotations
NicolasHug Jul 21, 2020
8255405
some more
NicolasHug Jul 21, 2020
f54459e
typos
NicolasHug Jul 21, 2020
f962425
Addressed comments
NicolasHug Jul 21, 2020
dda80b3
Specify stricter output type
NicolasHug Jul 21, 2020
a457b38
Again
NicolasHug Jul 21, 2020
8d85ba9
hopefully fix mypy?
NicolasHug Jul 21, 2020
bbaeffa
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Jul 29, 2020
81f2a7f
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Sep 1, 2020
fdd7610
added test for coverage
NicolasHug Sep 1, 2020
fe965b0
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Sep 4, 2020
4478597
changed to whatsnew 112
NicolasHug Sep 5, 2020
ac51202
used proper whatsnew file
NicolasHug Sep 5, 2020
722c83b
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Sep 5, 2020
c601c2d
Merge branch 'master' into subplot_groups
NicolasHug Sep 6, 2020
c99abba
Merge branch 'master' into subplot_groups
NicolasHug Nov 6, 2020
ea6c73d
unwgnjgrwnjwgnjrkknrjg
NicolasHug Nov 6, 2020
9184e64
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jan 17, 2021
34515a8
Addressed comments
NicolasHug Jan 17, 2021
ae90509
remove code due to merge mess up
NicolasHug Jan 17, 2021
827c5b5
fix mypy
NicolasHug Jan 17, 2021
5aace24
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jan 17, 2021
b28d3b3
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jan 18, 2021
587cbc1
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Apr 3, 2021
1517ae3
Merge branch 'master' into subplot_groups
NicolasHug May 10, 2021
8c61b83
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug May 10, 2021
b75d86c
merge again
NicolasHug May 10, 2021
55928ff
mypy for the billionth time
NicolasHug May 10, 2021
b5f47f2
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jun 8, 2021
43070d3
add return type
NicolasHug Jun 8, 2021
e3c0146
add line
NicolasHug Jun 8, 2021
a4c2676
mypy for the billionth time + 1
NicolasHug Jun 8, 2021
71b7b39
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Oct 4, 2021
4157618
Merge branch 'main' of github.com:pandas-dev/pandas into subplot_groups
NicolasHug Jan 17, 2022
9b44546
update whatnew
NicolasHug Jan 17, 2022
2d89f73
Merge branch 'main' into subplot_groups
NicolasHug Mar 7, 2022
8cd6342
import Iterable from typing.
NicolasHug Mar 29, 2022
ffd4b18
Merge branch 'main' of github.com:pandas-dev/pandas into subplot_groups
NicolasHug Mar 29, 2022
66342bc
Merge remote-tracking branch 'upstream/main' into subplot_groups
mroeschke Jun 2, 2022
48c9f32
Process groups to indices in one pass
mroeschke Jun 3, 2022
aa128dc
Merge remote-tracking branch 'upstream/main' into subplot_groups
mroeschke Jun 3, 2022
69efd18
Fix typing check
mroeschke Jun 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,7 @@ Other enhancements
- :meth:`DataFrame.applymap` can now accept kwargs to pass on to the user-provided ``func`` (:issue:`39987`)
- Passing a :class:`DataFrame` indexer to ``iloc`` is now disallowed for :meth:`Series.__getitem__` and :meth:`DataFrame.__getitem__` (:issue:`39004`)
- :meth:`Series.apply` can now accept list-like or dictionary-like arguments that aren't lists or dictionaries, e.g. ``ser.apply(np.array(["sum", "mean"]))``, which was already the case for :meth:`DataFrame.apply` (:issue:`39140`)
- :meth:`DataFrame.plot` will now allow the ``subplots`` parameter to be a list of iterables specifying column groups, so that columns may be grouped together in the same subplot (:issue:`29688`).
- :meth:`DataFrame.plot.scatter` can now accept a categorical column for the argument ``c`` (:issue:`12380`, :issue:`31357`)
- :meth:`Series.loc` now raises a helpful error message when the Series has a :class:`MultiIndex` and the indexer has too many dimensions (:issue:`35349`)
- :func:`read_stata` now supports reading data from compressed files (:issue:`26599`)
Expand Down
14 changes: 12 additions & 2 deletions pandas/plotting/_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -649,8 +649,18 @@ class PlotAccessor(PandasObject):
- 'hexbin' : hexbin plot (DataFrame only)
ax : matplotlib axes object, default None
An axes of the current figure.
subplots : bool, default False
Make separate subplots for each column.
subplots : bool or list of iterables, default False
Whether to group columns into subplots:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, are the docs rendered correctly with this empty line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an empty line because we're starting a list.
Is there a CI job that builds the docs? I couldn't find it

Copy link
Member

@ivanovmg ivanovmg Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, CI runs to build the docs (but I am not sure that it checks for these features).
But you can build the particular page (for DataFrame.plot) yourself to ensure it is correct.
https://pandas.pydata.org/pandas-docs/dev/development/contributing_documentation.html#building-the-documentation

Copy link
Contributor Author

@NicolasHug NicolasHug Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't build the docs

python make.py clean
python make.py --single pandas.DataFrame.plot

fails with stuff like WARNING: autodoc: failed to import method 'plot.scatter' from module 'DataFrame'; the following exception was raised: No module named 'DataFrame'

(and with some theme issue)

So instead I just added a line so that the docstring for subplots is consistent with the rest. If the rest is fine, then this one should be fine too.

- ``False`` - no subplots will be used
- ``True`` - Make separate subplots for each column.
- ``sequence of sequences of str`` - create a subplot for each group of columns.
For example `[('a', 'c'), ('b', 'd')]` will create 2 subplots: one
with columns 'a' and 'c', and one with columns 'b' and 'd'.
Remaining columns that aren't specified will be plotted in
additional subplots (one per column).
.. versionadded:: 1.3.0

sharex : bool, default True if ax is None else False
In case ``subplots=True``, share x axis and set some x axis labels
to invisible; defaults to True if ax is None otherwise False if
Expand Down
121 changes: 117 additions & 4 deletions pandas/plotting/_matplotlib/core.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
from __future__ import annotations

from collections import (
Counter,
Iterable,
)
from typing import (
TYPE_CHECKING,
Hashable,
Sequence,
)
import warnings

Expand Down Expand Up @@ -103,7 +108,7 @@ def __init__(
data,
kind=None,
by: IndexLabel | None = None,
subplots=False,
subplots: bool | Sequence[Sequence[str]] = False,
sharex=None,
sharey=False,
use_index=True,
Expand Down Expand Up @@ -167,8 +172,7 @@ def __init__(
self.kind = kind

self.sort_columns = sort_columns

self.subplots = subplots
self.subplots = self._validate_subplots_kwarg(subplots)

if sharex is None:

Expand Down Expand Up @@ -254,6 +258,98 @@ def __init__(

self._validate_color_args()

def _validate_subplots_kwarg(
self, subplots: bool | Sequence[Sequence[str]]
) -> bool | list[tuple[int, ...]]:
"""
Validate the subplots parameter

- check type and content
- check for duplicate columns
- check for invalid column names
- convert column names into indices
- add missing columns in a group of their own
See comments in code below for more details.

Parameters
----------
subplots : subplots parameters as passed to PlotAccessor

Returns
-------
validated subplots : a bool or a list of tuples of column indices. Columns
in the same tuple will be grouped together in the resulting plot.
"""

if isinstance(subplots, bool):
return subplots
elif not isinstance(subplots, Iterable):
raise ValueError("subplots should be a bool or an iterable")

supported_kinds = (
"line",
"bar",
"barh",
"hist",
"kde",
"density",
"area",
"pie",
)
if self._kind not in supported_kinds:
raise ValueError(
"When subplots is an iterable, kind must be "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if subplots is boolean, then would it be necessary to validate self._kind?
If so, then, I guess, it deserves a separate method to be invoked at the moment of self.kind initialization.

By the way, this is quite confusing for me (probably regardless of the given PR): inside MPLPlot one initializes self.kind, but it seems that it is not used anywhere, only self._kind does.
Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if subplots is boolean, then would it be necessary to validate self._kind?

When subplots is boolean we return immediately as we fall back to the current behaviour, as in master. I haven't double checked but I would hope that self._kind is validated separately - I mean, it's _kind, not kind.

The validation happening in _validate_subplots_kwarg is only for subplots, and its interactions with other parameters (such as _kind) when subplots is an iterable. But its job is not to validate _kind.

f"one of {', '.join(supported_kinds)}. Got {self._kind}."
)

if any(
not isinstance(group, Iterable) or isinstance(group, str)
for group in subplots
):
raise ValueError(
"When subplots is an iterable, each entry "
"should be a list/tuple of column names."
)

cols_in_groups = [col for group in subplots for col in group]
duplicates = {col for (col, cnt) in Counter(cols_in_groups).items() if cnt > 1}
if duplicates:
raise ValueError(
f"Each column should be in only one subplot. Columns {duplicates} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a desirable behavior?
Why not allowing one to plot the same line against different sets?

Copy link
Contributor Author

@NicolasHug NicolasHug Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently (in master), only one column per plot is possible.
Allowing more than one column per plot should be possible, but would make the PR more complex. This PR is already 1.5+ years long so I would leave that improvement for a subsequent PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agree, could be a follow-up enhancement PR!

"were found in multiple sublots."
)

cols_in_groups_set = set(cols_in_groups)
cols_remaining = set(self.data.columns) - cols_in_groups_set
bad_columns = cols_in_groups_set - set(self.data.columns)

if bad_columns:
raise ValueError(
"Subplots contains the following column(s) "
f"which are invalid names: {bad_columns}"
)

# subplots is a list of tuples where each tuple is a group of
# columns to be grouped together (one ax per group).
# we consolidate the subplots list such that:
# - the tuples contain indices instead of column names
# - the columns that aren't yet in the list are added in a group
# of their own.
# For example with columns from a to g, and
# subplots = [(a, c), (b, f, e)],
# we end up with [(ai, ci), (bi, fi, ei), (di,), (gi,)]
# This way, we can handle self.subplots in a homogeneous manner
# later.
# TODO: also accept indices instead of just names?

out = []
index = list(self.data.columns).index
for group in subplots:
out.append(tuple(index(col) for col in group))
for col in cols_remaining:
out.append((index(col),))
return out

def _validate_color_args(self):
if (
"color" in self.kwds
Expand Down Expand Up @@ -372,8 +468,11 @@ def _maybe_right_yaxis(self, ax: Axes, axes_num):

def _setup_subplots(self):
if self.subplots:
naxes = (
self.nseries if isinstance(self.subplots, bool) else len(self.subplots)
)
fig, axes = create_subplots(
naxes=self.nseries,
naxes=naxes,
sharex=self.sharex,
sharey=self.sharey,
figsize=self.figsize,
Expand Down Expand Up @@ -785,9 +884,23 @@ def _get_ax_layer(cls, ax, primary=True):
else:
return getattr(ax, "right_ax", ax)

def _col_idx_to_axis_idx(self, col_idx: int) -> int:
"""Return the index of the axis where the column at col_idx should be plotted"""
if isinstance(self.subplots, list):
# Subplots is a list: some columns will be grouped together in the same ax
return next(
group_idx
for (group_idx, group) in enumerate(self.subplots)
if col_idx in group
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this else clause required? I guess, that it may be dropped.

Copy link
Contributor Author

@NicolasHug NicolasHug Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a matter of style. I personally think that this way is less ambiguous when the first return statement is deeply nested with 4 indentation levels: it's easy to miss and to think that the function only returns col_idx. The else clause avoids that potential confusion. This choice was deliberate on my part, but I don't mind changing.

Copy link
Member

@ivanovmg ivanovmg Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, that once you added return type, mypy started complaining.
Inside the loop you return only in case if col_idx in group, this is the reason for the mypy error.
I suggest that you modify it:

for group_idx, group in enumerate(self.subplots):
    if col_idx in group:
        return group_idx
raise KeyError(f"Failed to find {col_idx} in {self.subplots}")

Or something like this.

Copy link
Contributor Author

@NicolasHug NicolasHug Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mypy is wrong here. Because of the validation done earlier, there must be a valid column. Raising an exception might make mypy happy but it would never be hit (because of the validation), and we wouldn't be able to test against it so this would reduce test coverage. I changed the for loop into a next(), which seems to trick mypy into being reasonable again.

# subplots is True: one ax per column
return col_idx

def _get_ax(self, i: int):
# get the twinx ax if appropriate
if self.subplots:
i = self._col_idx_to_axis_idx(i)
ax = self.axes[i]
ax = self._maybe_right_yaxis(ax, i)
self.axes[i] = ax
Expand Down
66 changes: 66 additions & 0 deletions pandas/tests/plotting/frame/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2039,6 +2039,72 @@ def test_plot_no_numeric_data(self):
with pytest.raises(TypeError, match="no numeric data to plot"):
df.plot()

@td.skip_if_no_scipy
@pytest.mark.parametrize(
"kind", ("line", "bar", "barh", "hist", "kde", "density", "area", "pie")
)
def test_group_subplot(self, kind):
d = {
"a": np.arange(10),
"b": np.arange(10) + 1,
"c": np.arange(10) + 1,
"d": np.arange(10),
"e": np.arange(10),
}
df = DataFrame(d)

axes = df.plot(subplots=[("b", "e"), ("c", "d")], kind=kind)
assert len(axes) == 3 # 2 groups + single column a

expected_labels = (["b", "e"], ["c", "d"], ["a"])
for ax, labels in zip(axes, expected_labels):
if kind != "pie":
self._check_legend_labels(ax, labels=labels)
if kind == "line":
assert len(ax.lines) == len(labels)

@pytest.mark.parametrize(
"subplots, expected_msg",
[
(123, "subplots should be a bool or an iterable"),
("a", "each entry should be a list/tuple"), # iterable of non-iterable
((1,), "each entry should be a list/tuple"), # iterable of non-iterable
(("a",), "each entry should be a list/tuple"), # iterable of strings
],
)
def test_group_subplot_bad_input(self, subplots, expected_msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good parametrization!

# Make sure error is raised when subplots is not a properly
# formatted iterable. Only iterables of iterables are permitted, and
# entries should not be strings.
d = {"a": np.arange(10), "b": np.arange(10)}
df = DataFrame(d)

with pytest.raises(ValueError, match=expected_msg):
df.plot(subplots=subplots)

def test_group_subplot_invalid_column_name(self):
d = {"a": np.arange(10), "b": np.arange(10)}
df = DataFrame(d)

with pytest.raises(ValueError, match="invalid names: {'bad_name'}"):
df.plot(subplots=[("a", "bad_name")])

def test_group_subplot_duplicated_column(self):
d = {"a": np.arange(10), "b": np.arange(10), "c": np.arange(10)}
df = DataFrame(d)

with pytest.raises(ValueError, match="should be in only one subplot"):
Copy link
Member

@ivanovmg ivanovmg Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a desirable behavior (see other comment)?

df.plot(subplots=[("a", "b"), ("a", "c")])

@pytest.mark.parametrize("kind", ("box", "scatter", "hexbin"))
def test_group_subplot_invalid_kind(self, kind):
d = {"a": np.arange(10), "b": np.arange(10)}
df = DataFrame(d)
with pytest.raises(
ValueError, match="When subplots is an iterable, kind must be one of"
):
df.plot(subplots=[("a", "b")], kind=kind)

@pytest.mark.parametrize(
"index_name, old_label, new_label",
[
Expand Down