Skip to content

BUG: df.pivot_table fails when margin is True and only columns is defined #31088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jan 20, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions asv_bench/benchmarks/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,9 @@ def time_pivot_table_categorical_observed(self):
observed=True,
)

def time_pivot_table_margins_only_column(self):
self.df.pivot_table(columns=["key2", "key3"], margins=True)


class Crosstab:
def setup(self):
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ Reshaping

-
- Bug in :meth:`DataFrame.pivot_table` when only MultiIndexed columns is set (:issue:`17038`)
- Bug in :meth:`DataFrame.pivot_table` when ``margin`` is ``True`` and only ``column`` is defined (:issue:`31016`)
- Fix incorrect error message in :meth:`DataFrame.pivot` when ``columns`` is set to ``None``. (:issue:`30924`)
- Bug in :func:`crosstab` when inputs are two Series and have tuple names, the output will keep dummy MultiIndex as columns. (:issue:`18321`)

Expand Down
36 changes: 15 additions & 21 deletions pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,15 +226,7 @@ def _add_margins(

elif values:
marginal_result_set = _generate_marginal_results(
table,
data,
values,
rows,
cols,
aggfunc,
observed,
grand_margin,
margins_name,
table, data, values, rows, cols, aggfunc, observed, margins_name,
)
if not isinstance(marginal_result_set, tuple):
return marginal_result_set
Expand Down Expand Up @@ -303,15 +295,7 @@ def _compute_grand_margin(data, values, aggfunc, margins_name: str = "All"):


def _generate_marginal_results(
table,
data,
values,
rows,
cols,
aggfunc,
observed,
grand_margin,
margins_name: str = "All",
table, data, values, rows, cols, aggfunc, observed, margins_name: str = "All",
):
if len(cols) > 0:
# need to "interleave" the margins
Expand Down Expand Up @@ -345,12 +329,22 @@ def _all_key(key):
table_pieces.append(piece)
margin_keys.append(all_key)
else:
margin = grand_margin
from pandas import DataFrame

cat_axis = 0
for key, piece in table.groupby(level=0, axis=cat_axis, observed=observed):
all_key = _all_key(key)
if len(cols) > 1:
all_key = _all_key(key)
else:
all_key = margins_name
table_pieces.append(piece)
table_pieces.append(Series(margin[key], index=[all_key]))
# GH31016 this is to calculate margin for each group, and assign
# corresponded key as index
transformed_piece = DataFrame(piece.apply(aggfunc)).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any performance impacts?

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be? i was thinking of it as well

how to measure it? do you mean create a giant mock dataset and time it? any suggestions? would like to test it out!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have asvs in benchmarks/reshape.py that would be good to run here at least

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! @WillAyd i am not very familiar with asv bench, but i see some tests starting with time_pivot****, shall I add a new test in there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the existing ones first:

cd asv_bench
asv continuous upstream/master HEAD -b reshape.PivotTable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2020-01-16 at 10 58 20 PM

just copied and pasted and tested a bit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the existing ones first:

cd asv_bench
asv continuous upstream/master HEAD -b reshape.PivotTable

oops! thanks for the tip!! did not see it 😅 will run

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running asv on current i get: BENCHMARKS NOT SIGNIFICANTLY CHANGED.

but i do find an issue with self-review, will investigate and fix tomorrow. thanks for the help on asv @WillAyd

transformed_piece.index = Index([all_key], name=piece.index.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 what if piece.index is a MultiIndex? .name here will be None. Do we need to use .names somewhere?

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, it will be None if it is a MultiIndex. Need to have a deeper look to see it is reasonable to use .names because if using it, then that's a frozen list iirc, and we could not assign to index.name. Also if a MI here, then maybe also need to use MultiIndex instead of Index.


# append piece for margin into table_piece
table_pieces.append(transformed_piece)
margin_keys.append(all_key)

result = concat(table_pieces, axis=cat_axis)
Expand Down
58 changes: 58 additions & 0 deletions pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -910,6 +910,64 @@ def _check_output(
totals = table.loc[("All", ""), item]
assert totals == self.data[item].mean()

@pytest.mark.parametrize(
"columns, aggfunc, values, expected_columns",
[
(
"A",
np.mean,
[[5.5, 5.5, 2.2, 2.2], [8.0, 8.0, 4.4, 4.4]],
Index(["bar", "All", "foo", "All"], name="A"),
),
(
["A", "B"],
"sum",
[[9, 13, 22, 5, 6, 11], [14, 18, 32, 11, 11, 22]],
MultiIndex.from_tuples(
[
("bar", "one"),
("bar", "two"),
("bar", "All"),
("foo", "one"),
("foo", "two"),
("foo", "All"),
],
names=["A", "B"],
),
),
],
)
def test_margin_with_only_columns_defined(
self, columns, aggfunc, values, expected_columns
):
# GH 31016
df = pd.DataFrame(
{
"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"],
"C": [
"small",
"large",
"large",
"small",
"small",
"large",
"small",
"small",
"large",
],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9],
}
)

result = df.pivot_table(columns=columns, margins=True, aggfunc=aggfunc)
expected = pd.DataFrame(
values, index=Index(["D", "E"]), columns=expected_columns
)

tm.assert_frame_equal(result, expected)

def test_margins_dtype(self):
# GH 17013

Expand Down