Skip to content

BUG: pivot/unstack leading to too many items should raise exception #23512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Dec 31, 2018

Conversation

sweb
Copy link
Contributor

@sweb sweb commented Nov 5, 2018

@pep8speaks
Copy link

Hello @sweb! Thanks for submitting the PR.

@jreback
Copy link
Contributor

jreback commented Nov 5, 2018

is this like #20784 ? (you can take that over if you wish)

@sweb
Copy link
Contributor Author

sweb commented Nov 5, 2018

@jreback Yes, it is quite similar, however my changes are a bit more local. I do not attempt to compute how many rows/cols are allowed but just check whether the already computed value is negative due to an integer overflow.

However, it appears that this is OS dependent. On Unix np.product will not result in a negative value, but directly computes with int64 or promotes it after reaching a limit.

In my current solution I explicitly set dtype of np.product to int32 - this will mean a change in current behavior for Unix systems.

Do you think this is a valid approach or should we set it to int64 as originally proposed?

@jreback
Copy link
Contributor

jreback commented Nov 6, 2018

@sweb the approach in the other issue is good.

@sweb
Copy link
Contributor Author

sweb commented Nov 6, 2018

@jreback okay. Since I am new here one last quick question: Is it possible for me to incorporate the other PR into this / get the other PR ready for merge or do I have to copy it by hand and push it again? Thanks!

@jreback
Copy link
Contributor

jreback commented Nov 6, 2018

you can simply cherry-pick that commit (ideally). or you can copy it whatever works.

@sweb sweb force-pushed the too_many_items_to_unstack branch from 7fd2601 to 59678a6 Compare November 8, 2018 11:05
@sweb
Copy link
Contributor Author

sweb commented Nov 8, 2018

@jreback Thanks for the advice. I patched the commits of the other PR into this branch. Unfortunately, the proposed solution does not solve the problem - at least not on my environment.

Since this is most likely a Windows-specific error (I checked the example from #20601 on MacOs and everything works fine - I did not check it on Linux yet) the proposed solution would actually limit the capabilities of unstack().

In addition, on my Win10 machine, I still get an error, since num_rows * num_columns also overflows and the condition if num_rows * num_columns > (2 ** 31 - 1) is never met.

I modified the PR in that a numpy warning, indicating the overflow is caught instead of checking the upper bound of int32. The tests are only applied to Windows environments.

Can you replicate this? Maybe my Windows environment setup is screwed up.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Nov 11, 2018
@@ -81,9 +81,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
pass
values = list(values)

# group by the cartesian product of the grouper
# if we have a categorical
grouped = data.groupby(keys, observed=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - this is from the original pull request and I did not know what it was for. I will change it back to the way it was.

num_rows = np.max([index_level.size for index_level
in self.new_index_levels])
num_columns = self.removed_level.size
with np.errstate(all='raise'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you just compare vs np.iinfo('int32').max?

@pytest.mark.slow
def test_pivot_number_of_levels_larger_than_int32(self):
# GH 20601
if sys.platform == 'win32':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_platform_windows (import from pandas.compat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this on windows only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know 100% percent, but numpy appears to treat int32 differently on Windows than on Unix:
https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine/36279549

numpy/numpy#8433

On my MacOs environment, the code from the issue works - takes some time but pandas (0.23.4) computes the expected result while my Windows environment encounters the described error. I therefore think, that this is a Windows only issue and we probably should not limit other OS versions by always checking against the int32 upper limit.

# GH 20601
if sys.platform == 'win32':
df = DataFrame({'ind1': np.arange(2 ** 16),
'ind2': np.arange(2 ** 16),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't need to be this big, just something that exceeds int32 (2B) is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this further? If I decrease the value to 2 ** 15 I do not get an overflow.

@@ -1212,6 +1214,15 @@ def test_unstack_unobserved_keys(self):
recons = result.stack()
tm.assert_frame_equal(recons, df)

@pytest.mark.slow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 11, 2018
@codecov
Copy link

codecov bot commented Nov 12, 2018

Codecov Report

Merging #23512 into master will decrease coverage by 60.41%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #23512       +/-   ##
===========================================
- Coverage   92.31%   31.89%   -60.42%     
===========================================
  Files         166      166               
  Lines       52412    52426       +14     
===========================================
- Hits        48382    16722    -31660     
- Misses       4030    35704    +31674
Flag Coverage Δ
#multiple 30.29% <0%> (-60.45%) ⬇️
#single 31.89% <0%> (-11.17%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/pivot.py 8.77% <ø> (-87.78%) ⬇️
pandas/core/reshape/reshape.py 7.97% <0%> (-91.6%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-98.65%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 128 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5e5bf7...a3cdbca. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

can you merge master

@sweb
Copy link
Contributor Author

sweb commented Dec 3, 2018

Master was merged!

@jreback
Copy link
Contributor

jreback commented Dec 23, 2018

This should raise regardless of the platform, the OP question is simply too large at 4B entries and is unsupportable.

if you want to merge master and update will look.

@sweb
Copy link
Contributor Author

sweb commented Dec 30, 2018

The ValueError now raises regardless of OS.

@jreback jreback added this to the 0.24.0 milestone Dec 30, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. if you can add a whatsnew, put in reshaping bug fix section. saying that the error message is now improved, for pivoting with > int32.max uniques (obviously make a nice sentence about this).

@sweb
Copy link
Contributor Author

sweb commented Dec 31, 2018

Whatsnew added, thank you for your help!

@jreback jreback merged commit d85a5c3 into pandas-dev:master Dec 31, 2018
@jreback
Copy link
Contributor

jreback commented Dec 31, 2018

thanks!

@sweb sweb deleted the too_many_items_to_unstack branch December 31, 2018 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: pivot_table when number of levels larger than int32 range
4 participants