BUG: pivot/unstack leading to too many items should raise exception #23512

sweb · 2018-11-05T16:56:10Z

closes ERR: pivot_table when number of levels larger than int32 range #20601
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-11-05T16:56:13Z

Hello @sweb! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/reshape/reshape.py !
There are no PEP8 issues in the file pandas/tests/reshape/test_reshape.py !

jreback · 2018-11-05T19:18:14Z

is this like #20784 ? (you can take that over if you wish)

sweb · 2018-11-05T20:32:03Z

@jreback Yes, it is quite similar, however my changes are a bit more local. I do not attempt to compute how many rows/cols are allowed but just check whether the already computed value is negative due to an integer overflow.

However, it appears that this is OS dependent. On Unix np.product will not result in a negative value, but directly computes with int64 or promotes it after reaching a limit.

In my current solution I explicitly set dtype of np.product to int32 - this will mean a change in current behavior for Unix systems.

Do you think this is a valid approach or should we set it to int64 as originally proposed?

jreback · 2018-11-06T03:13:48Z

@sweb the approach in the other issue is good.

sweb · 2018-11-06T07:26:35Z

@jreback okay. Since I am new here one last quick question: Is it possible for me to incorporate the other PR into this / get the other PR ready for merge or do I have to copy it by hand and push it again? Thanks!

jreback · 2018-11-06T15:21:30Z

you can simply cherry-pick that commit (ideally). or you can copy it whatever works.

… larger than int32

…verflow

…: in pivot_table and unstack

…'s too big

* Modify tests to only cover windows platforms

sweb · 2018-11-08T12:04:44Z

@jreback Thanks for the advice. I patched the commits of the other PR into this branch. Unfortunately, the proposed solution does not solve the problem - at least not on my environment.

Since this is most likely a Windows-specific error (I checked the example from #20601 on MacOs and everything works fine - I did not check it on Linux yet) the proposed solution would actually limit the capabilities of unstack().

In addition, on my Win10 machine, I still get an error, since num_rows * num_columns also overflows and the condition if num_rows * num_columns > (2 ** 31 - 1) is never met.

I modified the PR in that a numpy warning, indicating the overflow is caught instead of checking the upper bound of int32. The tests are only applied to Windows environments.

Can you replicate this? Maybe my Windows environment setup is screwed up.

jreback · 2018-11-11T16:24:35Z

pandas/core/reshape/pivot.py

@@ -81,9 +81,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
                pass
        values = list(values)

-    # group by the cartesian product of the grouper
-    # if we have a categorical
-    grouped = data.groupby(keys, observed=False)


why did you change this?

Good question - this is from the original pull request and I did not know what it was for. I will change it back to the way it was.

jreback · 2018-11-11T16:26:21Z

pandas/core/reshape/reshape.py

+        num_rows = np.max([index_level.size for index_level
+                           in self.new_index_levels])
+        num_columns = self.removed_level.size
+        with np.errstate(all='raise'):


can't you just compare vs np.iinfo('int32').max?

pandas/core/reshape/reshape.py

jreback · 2018-11-11T16:27:08Z

pandas/tests/reshape/test_pivot.py

+    @pytest.mark.slow
+    def test_pivot_number_of_levels_larger_than_int32(self):
+        # GH 20601
+        if sys.platform == 'win32':


use is_platform_windows (import from pandas.compat

why is this on windows only?

I don't know 100% percent, but numpy appears to treat int32 differently on Windows than on Unix:
https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine/36279549

numpy/numpy#8433

On my MacOs environment, the code from the issue works - takes some time but pandas (0.23.4) computes the expected result while my Windows environment encounters the described error. I therefore think, that this is a Windows only issue and we probably should not limit other OS versions by always checking against the int32 upper limit.

jreback · 2018-11-11T16:27:53Z

pandas/tests/reshape/test_pivot.py

+        # GH 20601
+        if sys.platform == 'win32':
+            df = DataFrame({'ind1': np.arange(2 ** 16),
+                            'ind2': np.arange(2 ** 16),


it doesn't need to be this big, just something that exceeds int32 (2B) is needed

Can you explain this further? If I decrease the value to 2 ** 15 I do not get an overflow.

jreback · 2018-11-11T16:28:18Z

pandas/tests/test_multilevel.py

@@ -1212,6 +1214,15 @@ def test_unstack_unobserved_keys(self):
        recons = result.stack()
        tm.assert_frame_equal(recons, df)

+    @pytest.mark.slow


same comments as above

codecov · 2018-11-12T14:17:47Z

Codecov Report

Merging #23512 into master will decrease coverage by 60.41%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #23512       +/-   ##
===========================================
- Coverage   92.31%   31.89%   -60.42%     
===========================================
  Files         166      166               
  Lines       52412    52426       +14     
===========================================
- Hits        48382    16722    -31660     
- Misses       4030    35704    +31674

Flag	Coverage Δ
#multiple	`30.29% <0%> (-60.45%)`	⬇️
#single	`31.89% <0%> (-11.17%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/pivot.py	`8.77% <ø> (-87.78%)`	⬇️
pandas/core/reshape/reshape.py	`7.97% <0%> (-91.6%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-98.65%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
... and 128 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5e5bf7...a3cdbca. Read the comment docs.

jreback · 2018-12-03T02:01:23Z

can you merge master

…_unstack

sweb · 2018-12-03T15:25:29Z

Master was merged!

jreback · 2018-12-23T23:27:20Z

This should raise regardless of the platform, the OP question is simply too large at 4B entries and is unsupportable.

if you want to merge master and update will look.

…_unstack

sweb · 2018-12-30T17:52:31Z

The ValueError now raises regardless of OS.

jreback

lgtm. if you can add a whatsnew, put in reshaping bug fix section. saying that the error message is now improved, for pivoting with > int32.max uniques (obviously make a nice sentence about this).

sweb · 2018-12-31T12:33:39Z

Whatsnew added, thank you for your help!

jreback · 2018-12-31T13:16:46Z

thanks!

…andas-dev#23512)

anhqle and others added 8 commits November 8, 2018 10:26

ENH GH20601 raise an error when the number of levels in a pivot table…

da25b19

… larger than int32

TST add a test for pivot table large number of levels causing int32 o…

e6c88c1

…verflow

CLN PEP8 compliance

db2319e

ENH catch the int32 overflow error earlier and in two separate places…

6b7b030

…: in pivot_table and unstack

CLN PEP8 compliance

23dae93

ENH calculate size of the resulting pivot table and raise error if it…

a69438f

…'s too big

rebase onto upstream master

b44ca16

ENH: Raise and catch FloatingPointException due to overflow

59678a6

* Modify tests to only cover windows platforms

sweb force-pushed the too_many_items_to_unstack branch from 7fd2601 to 59678a6 Compare November 8, 2018 11:05

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Nov 11, 2018

jreback requested changes Nov 11, 2018

View reviewed changes

jreback mentioned this pull request Nov 11, 2018

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

Closed

jreback added Error Reporting Incorrect or improved errors from pandas Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 11, 2018

ENH: use pd.compat for windows check, add comment

4dbbad7

Merge remote-tracking branch 'upstream/master' into too_many_items_to…

fe48db9

…_unstack

sweb added 3 commits December 30, 2018 16:58

ENH: ValueError on all platforms when max int32 is reached

263f598

Merge remote-tracking branch 'upstream/master' into too_many_items_to…

c0cff5a

…_unstack

CLN: Added comment for overflow

b96689d

BUG: zero cells should be allowed

241729f

jreback added this to the 0.24.0 milestone Dec 30, 2018

jreback requested changes Dec 30, 2018

View reviewed changes

DOC: Added whatsnew entry (pandas-dev#23512)

a3cdbca

jreback approved these changes Dec 31, 2018

View reviewed changes

jreback merged commit d85a5c3 into pandas-dev:master Dec 31, 2018

sweb deleted the too_many_items_to_unstack branch December 31, 2018 15:51

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: pivot/unstack leading to too many items should raise exception (p…

99c318e

…andas-dev#23512)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: pivot/unstack leading to too many items should raise exception (p…

d70580f

…andas-dev#23512)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pivot/unstack leading to too many items should raise exception #23512

BUG: pivot/unstack leading to too many items should raise exception #23512

sweb commented Nov 5, 2018 •

edited

Loading

pep8speaks commented Nov 5, 2018

jreback commented Nov 5, 2018

sweb commented Nov 5, 2018 •

edited

Loading

jreback commented Nov 6, 2018

sweb commented Nov 6, 2018

jreback commented Nov 6, 2018

sweb commented Nov 8, 2018

jreback Nov 11, 2018

sweb Nov 12, 2018

jreback Nov 11, 2018

jreback Nov 11, 2018

jreback Nov 11, 2018

sweb Nov 12, 2018

jreback Nov 11, 2018

sweb Nov 12, 2018

jreback Nov 11, 2018

codecov bot commented Nov 12, 2018 •

edited

Loading

jreback commented Dec 3, 2018

sweb commented Dec 3, 2018

jreback commented Dec 23, 2018

sweb commented Dec 30, 2018

jreback left a comment

sweb commented Dec 31, 2018

jreback commented Dec 31, 2018

BUG: pivot/unstack leading to too many items should raise exception #23512

BUG: pivot/unstack leading to too many items should raise exception #23512

Conversation

sweb commented Nov 5, 2018 • edited Loading

pep8speaks commented Nov 5, 2018

jreback commented Nov 5, 2018

sweb commented Nov 5, 2018 • edited Loading

jreback commented Nov 6, 2018

sweb commented Nov 6, 2018

jreback commented Nov 6, 2018

sweb commented Nov 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 12, 2018 • edited Loading

Codecov Report

jreback commented Dec 3, 2018

sweb commented Dec 3, 2018

jreback commented Dec 23, 2018

sweb commented Dec 30, 2018

jreback left a comment

Choose a reason for hiding this comment

sweb commented Dec 31, 2018

jreback commented Dec 31, 2018

sweb commented Nov 5, 2018 •

edited

Loading

sweb commented Nov 5, 2018 •

edited

Loading

codecov bot commented Nov 12, 2018 •

edited

Loading