-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pivot/unstack leading to too many items should raise exception #23512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @sweb! Thanks for submitting the PR.
|
is this like #20784 ? (you can take that over if you wish) |
@jreback Yes, it is quite similar, however my changes are a bit more local. I do not attempt to compute how many rows/cols are allowed but just check whether the already computed value is negative due to an integer overflow. However, it appears that this is OS dependent. On Unix In my current solution I explicitly set Do you think this is a valid approach or should we set it to |
@sweb the approach in the other issue is good. |
@jreback okay. Since I am new here one last quick question: Is it possible for me to incorporate the other PR into this / get the other PR ready for merge or do I have to copy it by hand and push it again? Thanks! |
you can simply cherry-pick that commit (ideally). or you can copy it whatever works. |
… larger than int32
…: in pivot_table and unstack
* Modify tests to only cover windows platforms
7fd2601
to
59678a6
Compare
@jreback Thanks for the advice. I patched the commits of the other PR into this branch. Unfortunately, the proposed solution does not solve the problem - at least not on my environment. Since this is most likely a Windows-specific error (I checked the example from #20601 on MacOs and everything works fine - I did not check it on Linux yet) the proposed solution would actually limit the capabilities of unstack(). In addition, on my Win10 machine, I still get an error, since I modified the PR in that a numpy warning, indicating the overflow is caught instead of checking the upper bound of int32. The tests are only applied to Windows environments. Can you replicate this? Maybe my Windows environment setup is screwed up. |
pandas/core/reshape/pivot.py
Outdated
@@ -81,9 +81,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', | |||
pass | |||
values = list(values) | |||
|
|||
# group by the cartesian product of the grouper | |||
# if we have a categorical | |||
grouped = data.groupby(keys, observed=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - this is from the original pull request and I did not know what it was for. I will change it back to the way it was.
pandas/core/reshape/reshape.py
Outdated
num_rows = np.max([index_level.size for index_level | ||
in self.new_index_levels]) | ||
num_columns = self.removed_level.size | ||
with np.errstate(all='raise'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just compare vs np.iinfo('int32').max?
pandas/tests/reshape/test_pivot.py
Outdated
@pytest.mark.slow | ||
def test_pivot_number_of_levels_larger_than_int32(self): | ||
# GH 20601 | ||
if sys.platform == 'win32': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_platform_windows
(import from pandas.compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this on windows only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know 100% percent, but numpy appears to treat int32 differently on Windows than on Unix:
https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine/36279549
On my MacOs environment, the code from the issue works - takes some time but pandas (0.23.4) computes the expected result while my Windows environment encounters the described error. I therefore think, that this is a Windows only issue and we probably should not limit other OS versions by always checking against the int32 upper limit.
pandas/tests/reshape/test_pivot.py
Outdated
# GH 20601 | ||
if sys.platform == 'win32': | ||
df = DataFrame({'ind1': np.arange(2 ** 16), | ||
'ind2': np.arange(2 ** 16), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't need to be this big, just something that exceeds int32 (2B) is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this further? If I decrease the value to 2 ** 15
I do not get an overflow.
@@ -1212,6 +1214,15 @@ def test_unstack_unobserved_keys(self): | |||
recons = result.stack() | |||
tm.assert_frame_equal(recons, df) | |||
|
|||
@pytest.mark.slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above
Codecov Report
@@ Coverage Diff @@
## master #23512 +/- ##
===========================================
- Coverage 92.31% 31.89% -60.42%
===========================================
Files 166 166
Lines 52412 52426 +14
===========================================
- Hits 48382 16722 -31660
- Misses 4030 35704 +31674
Continue to review full report at Codecov.
|
can you merge master |
Master was merged! |
This should raise regardless of the platform, the OP question is simply too large at 4B entries and is unsupportable. if you want to merge master and update will look. |
The ValueError now raises regardless of OS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. if you can add a whatsnew, put in reshaping bug fix section. saying that the error message is now improved, for pivoting with > int32.max uniques (obviously make a nice sentence about this).
Whatsnew added, thank you for your help! |
thanks! |
git diff upstream/master -u -- "*.py" | flake8 --diff