Skip to content

BUG: pivot/unstack leading to too many items should raise exception #23512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Dec 31, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1646,6 +1646,7 @@ Reshaping
- :meth:`DataFrame.nlargest` and :meth:`DataFrame.nsmallest` now returns the correct n values when keep != 'all' also when tied on the first columns (:issue:`22752`)
- Constructing a DataFrame with an index argument that wasn't already an instance of :class:`~pandas.core.Index` was broken (:issue:`22227`).
- Bug in :class:`DataFrame` prevented list subclasses to be used to construction (:issue:`21226`)
- Bug in :func:`DataFrame.unstack` and :func:`DataFrame.pivot_table` returning a missleading error message when the resulting DataFrame has more elements than int32 can handle. Now, the error message is improved, pointing towards the actual problem (:issue:`20601`)

.. _whatsnew_0240.bug_fixes.sparse:

Expand Down
2 changes: 0 additions & 2 deletions pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,6 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
pass
values = list(values)

# group by the cartesian product of the grouper
# if we have a categorical
grouped = data.groupby(keys, observed=False)
agged = grouped.agg(aggfunc)
if dropna and isinstance(agged, ABCDataFrame) and len(agged.columns):
Expand Down
15 changes: 15 additions & 0 deletions pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,21 @@ def __init__(self, values, index, level=-1, value_columns=None,
self.removed_level = self.new_index_levels.pop(self.level)
self.removed_level_full = index.levels[self.level]

# Bug fix GH 20601
# If the data frame is too big, the number of unique index combination
# will cause int32 overflow on windows environments.
# We want to check and raise an error before this happens
num_rows = np.max([index_level.size for index_level
in self.new_index_levels])
num_columns = self.removed_level.size

# GH20601: This forces an overflow if the number of cells is too high.
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

if num_rows > 0 and num_columns > 0 and num_cells <= 0:
raise ValueError('Unstacked DataFrame is too big, '
'causing int32 overflow')

self._make_sorted_values_labels()
self._make_selectors()

Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -1272,6 +1272,17 @@ def test_pivot_string_func_vs_func(self, f, f_numpy):
aggfunc=f_numpy)
tm.assert_frame_equal(result, expected)

@pytest.mark.slow
def test_pivot_number_of_levels_larger_than_int32(self):
# GH 20601
df = DataFrame({'ind1': np.arange(2 ** 16),
'ind2': np.arange(2 ** 16),
'count': 0})

with pytest.raises(ValueError, match='int32 overflow'):
df.pivot_table(index='ind1', columns='ind2',
values='count', aggfunc='count')


class TestCrosstab(object):

Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/test_multilevel.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from warnings import catch_warnings, simplefilter
import datetime
import itertools

import pytest
import pytz

Expand Down Expand Up @@ -720,6 +721,14 @@ def test_unstack_unobserved_keys(self):
recons = result.stack()
tm.assert_frame_equal(recons, df)

@pytest.mark.slow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above

def test_unstack_number_of_levels_larger_than_int32(self):
# GH 20601
df = DataFrame(np.random.randn(2 ** 16, 2),
index=[np.arange(2 ** 16), np.arange(2 ** 16)])
with pytest.raises(ValueError, match='int32 overflow'):
df.unstack()

def test_stack_order_with_unsorted_levels(self):
# GH 16323

Expand Down