Skip to content

ENH: Allow column grouping in DataFrame.plot #29944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 62 commits into from
Jun 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
469eff8
WIP
NicolasHug Dec 1, 2019
b96a659
minimal test
NicolasHug Dec 1, 2019
c5b187b
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Jan 17, 2020
99cc049
formatted files
NicolasHug Jan 17, 2020
06da910
linting again
NicolasHug Jan 17, 2020
49c728a
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Mar 30, 2020
b370ba5
Check number of lines and labels
NicolasHug Mar 30, 2020
d1a2ae0
Added input checks
NicolasHug Mar 30, 2020
10ac363
some comments
NicolasHug Mar 30, 2020
9774d42
Whatsnew entry
NicolasHug Mar 30, 2020
572de08
More tests
NicolasHug Mar 30, 2020
4519f22
Skip test if no scipy for kde
NicolasHug Mar 30, 2020
c6edb9f
Merge branch 'master' into subplot_groups
NicolasHug Apr 5, 2020
73dd7eb
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Apr 14, 2020
b804e57
Address comments
NicolasHug Apr 14, 2020
d7ee47c
fix docstring
NicolasHug Apr 14, 2020
8fb298d
Added error when columns are fouund in multiple subplots
NicolasHug Apr 14, 2020
d84f356
Merge branch 'subplot_groups' of github.com:NicolasHug/pandas into su…
NicolasHug Apr 14, 2020
ba0f982
sorted imports
NicolasHug Apr 14, 2020
fe98b86
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Jul 21, 2020
f81d295
Added type annotations
NicolasHug Jul 21, 2020
8255405
some more
NicolasHug Jul 21, 2020
f54459e
typos
NicolasHug Jul 21, 2020
f962425
Addressed comments
NicolasHug Jul 21, 2020
dda80b3
Specify stricter output type
NicolasHug Jul 21, 2020
a457b38
Again
NicolasHug Jul 21, 2020
8d85ba9
hopefully fix mypy?
NicolasHug Jul 21, 2020
bbaeffa
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Jul 29, 2020
81f2a7f
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Sep 1, 2020
fdd7610
added test for coverage
NicolasHug Sep 1, 2020
fe965b0
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Sep 4, 2020
4478597
changed to whatsnew 112
NicolasHug Sep 5, 2020
ac51202
used proper whatsnew file
NicolasHug Sep 5, 2020
722c83b
Merge branch 'master' of https://github.com/pandas-dev/pandas into su…
NicolasHug Sep 5, 2020
c601c2d
Merge branch 'master' into subplot_groups
NicolasHug Sep 6, 2020
c99abba
Merge branch 'master' into subplot_groups
NicolasHug Nov 6, 2020
ea6c73d
unwgnjgrwnjwgnjrkknrjg
NicolasHug Nov 6, 2020
9184e64
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jan 17, 2021
34515a8
Addressed comments
NicolasHug Jan 17, 2021
ae90509
remove code due to merge mess up
NicolasHug Jan 17, 2021
827c5b5
fix mypy
NicolasHug Jan 17, 2021
5aace24
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jan 17, 2021
b28d3b3
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jan 18, 2021
587cbc1
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Apr 3, 2021
1517ae3
Merge branch 'master' into subplot_groups
NicolasHug May 10, 2021
8c61b83
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug May 10, 2021
b75d86c
merge again
NicolasHug May 10, 2021
55928ff
mypy for the billionth time
NicolasHug May 10, 2021
b5f47f2
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Jun 8, 2021
43070d3
add return type
NicolasHug Jun 8, 2021
e3c0146
add line
NicolasHug Jun 8, 2021
a4c2676
mypy for the billionth time + 1
NicolasHug Jun 8, 2021
71b7b39
Merge branch 'master' of github.com:pandas-dev/pandas into subplot_gr…
NicolasHug Oct 4, 2021
4157618
Merge branch 'main' of github.com:pandas-dev/pandas into subplot_groups
NicolasHug Jan 17, 2022
9b44546
update whatnew
NicolasHug Jan 17, 2022
2d89f73
Merge branch 'main' into subplot_groups
NicolasHug Mar 7, 2022
8cd6342
import Iterable from typing.
NicolasHug Mar 29, 2022
ffd4b18
Merge branch 'main' of github.com:pandas-dev/pandas into subplot_groups
NicolasHug Mar 29, 2022
66342bc
Merge remote-tracking branch 'upstream/main' into subplot_groups
mroeschke Jun 2, 2022
48c9f32
Process groups to indices in one pass
mroeschke Jun 3, 2022
aa128dc
Merge remote-tracking branch 'upstream/main' into subplot_groups
mroeschke Jun 3, 2022
69efd18
Fix typing check
mroeschke Jun 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ Other enhancements
- :meth:`MultiIndex.to_frame` now supports the argument ``allow_duplicates`` and raises on duplicate labels if it is missing or False (:issue:`45245`)
- :class:`StringArray` now accepts array-likes containing nan-likes (``None``, ``np.nan``) for the ``values`` parameter in its constructor in addition to strings and :attr:`pandas.NA`. (:issue:`40839`)
- Improved the rendering of ``categories`` in :class:`CategoricalIndex` (:issue:`45218`)
- :meth:`DataFrame.plot` will now allow the ``subplots`` parameter to be a list of iterables specifying column groups, so that columns may be grouped together in the same subplot (:issue:`29688`).
- :meth:`to_numeric` now preserves float64 arrays when downcasting would generate values not representable in float32 (:issue:`43693`)
- :meth:`Series.reset_index` and :meth:`DataFrame.reset_index` now support the argument ``allow_duplicates`` (:issue:`44410`)
- :meth:`.GroupBy.min` and :meth:`.GroupBy.max` now supports `Numba <https://numba.pydata.org/>`_ execution with the ``engine`` keyword (:issue:`45428`)
Expand Down
14 changes: 12 additions & 2 deletions pandas/plotting/_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -649,8 +649,18 @@ class PlotAccessor(PandasObject):
- 'hexbin' : hexbin plot (DataFrame only)
ax : matplotlib axes object, default None
An axes of the current figure.
subplots : bool, default False
Make separate subplots for each column.
subplots : bool or sequence of iterables, default False
Whether to group columns into subplots:

- ``False`` : No subplots will be used
- ``True`` : Make separate subplots for each column.
- sequence of iterables of column labels: Create a subplot for each
group of columns. For example `[('a', 'c'), ('b', 'd')]` will
create 2 subplots: one with columns 'a' and 'c', and one
with columns 'b' and 'd'. Remaining columns that aren't specified
will be plotted in additional subplots (one per column).
.. versionadded:: 1.5.0

sharex : bool, default True if ax is None else False
In case ``subplots=True``, share x axis and set some x axis labels
to invisible; defaults to True if ax is None otherwise False if
Expand Down
132 changes: 128 additions & 4 deletions pandas/plotting/_matplotlib/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
from typing import (
TYPE_CHECKING,
Hashable,
Iterable,
Sequence,
)
import warnings

Expand Down Expand Up @@ -102,7 +104,7 @@ def __init__(
data,
kind=None,
by: IndexLabel | None = None,
subplots=False,
subplots: bool | Sequence[Sequence[str]] = False,
sharex=None,
sharey=False,
use_index=True,
Expand Down Expand Up @@ -166,8 +168,7 @@ def __init__(
self.kind = kind

self.sort_columns = sort_columns

self.subplots = subplots
self.subplots = self._validate_subplots_kwarg(subplots)

if sharex is None:

Expand Down Expand Up @@ -253,6 +254,112 @@ def __init__(

self._validate_color_args()

def _validate_subplots_kwarg(
self, subplots: bool | Sequence[Sequence[str]]
) -> bool | list[tuple[int, ...]]:
"""
Validate the subplots parameter

- check type and content
- check for duplicate columns
- check for invalid column names
- convert column names into indices
- add missing columns in a group of their own
See comments in code below for more details.

Parameters
----------
subplots : subplots parameters as passed to PlotAccessor

Returns
-------
validated subplots : a bool or a list of tuples of column indices. Columns
in the same tuple will be grouped together in the resulting plot.
"""

if isinstance(subplots, bool):
return subplots
elif not isinstance(subplots, Iterable):
raise ValueError("subplots should be a bool or an iterable")

supported_kinds = (
"line",
"bar",
"barh",
"hist",
"kde",
"density",
"area",
"pie",
)
if self._kind not in supported_kinds:
raise ValueError(
"When subplots is an iterable, kind must be "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if subplots is boolean, then would it be necessary to validate self._kind?
If so, then, I guess, it deserves a separate method to be invoked at the moment of self.kind initialization.

By the way, this is quite confusing for me (probably regardless of the given PR): inside MPLPlot one initializes self.kind, but it seems that it is not used anywhere, only self._kind does.
Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if subplots is boolean, then would it be necessary to validate self._kind?

When subplots is boolean we return immediately as we fall back to the current behaviour, as in master. I haven't double checked but I would hope that self._kind is validated separately - I mean, it's _kind, not kind.

The validation happening in _validate_subplots_kwarg is only for subplots, and its interactions with other parameters (such as _kind) when subplots is an iterable. But its job is not to validate _kind.

f"one of {', '.join(supported_kinds)}. Got {self._kind}."
)

if isinstance(self.data, ABCSeries):
raise NotImplementedError(
"An iterable subplots for a Series is not supported."
)

columns = self.data.columns
if isinstance(columns, ABCMultiIndex):
raise NotImplementedError(
"An iterable subplots for a DataFrame with a MultiIndex column "
"is not supported."
)

if columns.nunique() != len(columns):
raise NotImplementedError(
"An iterable subplots for a DataFrame with non-unique column "
"labels is not supported."
)

# subplots is a list of tuples where each tuple is a group of
# columns to be grouped together (one ax per group).
# we consolidate the subplots list such that:
# - the tuples contain indices instead of column names
# - the columns that aren't yet in the list are added in a group
# of their own.
# For example with columns from a to g, and
# subplots = [(a, c), (b, f, e)],
# we end up with [(ai, ci), (bi, fi, ei), (di,), (gi,)]
# This way, we can handle self.subplots in a homogeneous manner
# later.
# TODO: also accept indices instead of just names?

out = []
seen_columns: set[Hashable] = set()
for group in subplots:
if not is_list_like(group):
raise ValueError(
"When subplots is an iterable, each entry "
"should be a list/tuple of column names."
)
idx_locs = columns.get_indexer_for(group)
if (idx_locs == -1).any():
bad_labels = np.extract(idx_locs == -1, group)
raise ValueError(
f"Column label(s) {list(bad_labels)} not found in the DataFrame."
)
else:
unique_columns = set(group)
duplicates = seen_columns.intersection(unique_columns)
if duplicates:
raise ValueError(
"Each column should be in only one subplot. "
f"Columns {duplicates} were found in multiple subplots."
)
seen_columns = seen_columns.union(unique_columns)
out.append(tuple(idx_locs))

unseen_columns = columns.difference(seen_columns)
for column in unseen_columns:
idx_loc = columns.get_loc(column)
out.append((idx_loc,))
return out

def _validate_color_args(self):
if (
"color" in self.kwds
Expand Down Expand Up @@ -371,8 +478,11 @@ def _maybe_right_yaxis(self, ax: Axes, axes_num):

def _setup_subplots(self):
if self.subplots:
naxes = (
self.nseries if isinstance(self.subplots, bool) else len(self.subplots)
)
fig, axes = create_subplots(
naxes=self.nseries,
naxes=naxes,
sharex=self.sharex,
sharey=self.sharey,
figsize=self.figsize,
Expand Down Expand Up @@ -784,9 +894,23 @@ def _get_ax_layer(cls, ax, primary=True):
else:
return getattr(ax, "right_ax", ax)

def _col_idx_to_axis_idx(self, col_idx: int) -> int:
"""Return the index of the axis where the column at col_idx should be plotted"""
if isinstance(self.subplots, list):
# Subplots is a list: some columns will be grouped together in the same ax
return next(
group_idx
for (group_idx, group) in enumerate(self.subplots)
if col_idx in group
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this else clause required? I guess, that it may be dropped.

Copy link
Contributor Author

@NicolasHug NicolasHug Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a matter of style. I personally think that this way is less ambiguous when the first return statement is deeply nested with 4 indentation levels: it's easy to miss and to think that the function only returns col_idx. The else clause avoids that potential confusion. This choice was deliberate on my part, but I don't mind changing.

Copy link
Member

@ivanovmg ivanovmg Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, that once you added return type, mypy started complaining.
Inside the loop you return only in case if col_idx in group, this is the reason for the mypy error.
I suggest that you modify it:

for group_idx, group in enumerate(self.subplots):
    if col_idx in group:
        return group_idx
raise KeyError(f"Failed to find {col_idx} in {self.subplots}")

Or something like this.

Copy link
Contributor Author

@NicolasHug NicolasHug Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mypy is wrong here. Because of the validation done earlier, there must be a valid column. Raising an exception might make mypy happy but it would never be hit (because of the validation), and we wouldn't be able to test against it so this would reduce test coverage. I changed the for loop into a next(), which seems to trick mypy into being reasonable again.

# subplots is True: one ax per column
return col_idx

def _get_ax(self, i: int):
# get the twinx ax if appropriate
if self.subplots:
i = self._col_idx_to_axis_idx(i)
ax = self.axes[i]
ax = self._maybe_right_yaxis(ax, i)
self.axes[i] = ax
Expand Down
84 changes: 84 additions & 0 deletions pandas/tests/plotting/frame/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2071,6 +2071,90 @@ def test_plot_no_numeric_data(self):
with pytest.raises(TypeError, match="no numeric data to plot"):
df.plot()

@td.skip_if_no_scipy
@pytest.mark.parametrize(
"kind", ("line", "bar", "barh", "hist", "kde", "density", "area", "pie")
)
def test_group_subplot(self, kind):
d = {
"a": np.arange(10),
"b": np.arange(10) + 1,
"c": np.arange(10) + 1,
"d": np.arange(10),
"e": np.arange(10),
}
df = DataFrame(d)

axes = df.plot(subplots=[("b", "e"), ("c", "d")], kind=kind)
assert len(axes) == 3 # 2 groups + single column a

expected_labels = (["b", "e"], ["c", "d"], ["a"])
for ax, labels in zip(axes, expected_labels):
if kind != "pie":
self._check_legend_labels(ax, labels=labels)
if kind == "line":
assert len(ax.lines) == len(labels)

def test_group_subplot_series_notimplemented(self):
ser = Series(range(1))
msg = "An iterable subplots for a Series"
with pytest.raises(NotImplementedError, match=msg):
ser.plot(subplots=[("a",)])

def test_group_subplot_multiindex_notimplemented(self):
df = DataFrame(np.eye(2), columns=MultiIndex.from_tuples([(0, 1), (1, 2)]))
msg = "An iterable subplots for a DataFrame with a MultiIndex"
with pytest.raises(NotImplementedError, match=msg):
df.plot(subplots=[(0, 1)])

def test_group_subplot_nonunique_cols_notimplemented(self):
df = DataFrame(np.eye(2), columns=["a", "a"])
msg = "An iterable subplots for a DataFrame with non-unique"
with pytest.raises(NotImplementedError, match=msg):
df.plot(subplots=[("a",)])

@pytest.mark.parametrize(
"subplots, expected_msg",
[
(123, "subplots should be a bool or an iterable"),
("a", "each entry should be a list/tuple"), # iterable of non-iterable
((1,), "each entry should be a list/tuple"), # iterable of non-iterable
(("a",), "each entry should be a list/tuple"), # iterable of strings
],
)
def test_group_subplot_bad_input(self, subplots, expected_msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good parametrization!

# Make sure error is raised when subplots is not a properly
# formatted iterable. Only iterables of iterables are permitted, and
# entries should not be strings.
d = {"a": np.arange(10), "b": np.arange(10)}
df = DataFrame(d)

with pytest.raises(ValueError, match=expected_msg):
df.plot(subplots=subplots)

def test_group_subplot_invalid_column_name(self):
d = {"a": np.arange(10), "b": np.arange(10)}
df = DataFrame(d)

with pytest.raises(ValueError, match=r"Column label\(s\) \['bad_name'\]"):
df.plot(subplots=[("a", "bad_name")])

def test_group_subplot_duplicated_column(self):
d = {"a": np.arange(10), "b": np.arange(10), "c": np.arange(10)}
df = DataFrame(d)

with pytest.raises(ValueError, match="should be in only one subplot"):
Copy link
Member

@ivanovmg ivanovmg Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a desirable behavior (see other comment)?

df.plot(subplots=[("a", "b"), ("a", "c")])

@pytest.mark.parametrize("kind", ("box", "scatter", "hexbin"))
def test_group_subplot_invalid_kind(self, kind):
d = {"a": np.arange(10), "b": np.arange(10)}
df = DataFrame(d)
with pytest.raises(
ValueError, match="When subplots is an iterable, kind must be one of"
):
df.plot(subplots=[("a", "b")], kind=kind)

@pytest.mark.parametrize(
"index_name, old_label, new_label",
[
Expand Down