-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH GH20601 raise error when pivot table's number of levels > int32 #20784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH GH20601 raise error when pivot table's number of levels > int32 #20784
Conversation
Hello @anhqle! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 31, 2018 at 17:22 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #20784 +/- ##
==========================================
- Coverage 92.07% 92.06% -0.01%
==========================================
Files 170 170
Lines 50688 50693 +5
==========================================
+ Hits 46671 46672 +1
- Misses 4017 4021 +4
Continue to review full report at Codecov.
|
generally use the original PR and push to it. having a new one is confusing. |
pandas/core/reshape/pivot.py
Outdated
@@ -29,6 +29,11 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', | |||
index = _convert_by(index) | |||
columns = _convert_by(columns) | |||
|
|||
num_rows = data.reindex(index, axis='columns').shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are doing extra work here (e.g. the reindex), can we not do the calculation directly?
pandas/core/reshape/pivot.py
Outdated
@@ -29,6 +29,11 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', | |||
index = _convert_by(index) | |||
columns = _convert_by(columns) | |||
|
|||
num_rows = data.reindex(index, axis='columns').shape[0] | |||
num_columns = data.reindex(columns, axis='columns').shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this is error is now in 2 places
pandas/core/reshape/pivot.py
Outdated
num_rows = (data.reindex(columns=index).drop_duplicates().shape[0] | ||
if index else 1) | ||
num_cols = (data.reindex(columns=columns).drop_duplicates().shape[0] | ||
if columns else 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to get an accurate size of the resulting pivot table, taking into account potential duplicates.
However I recognize that the performance hit may not be worth it given that this is such an edge case. It's up to you whether we should just raise the error within unstack
.
My apology about creating a new PR. I Thanks again for your guidance -- I'm learning my way through the codebase and hope to be an active contributor in the future. |
your u should rebase against upstream every time you push - it makes it so your code is not out of date wrt to master |
Do you think this PR is ready to be merged? Happy to do any additional work. |
can you rebase |
… larger than int32
…: in pivot_table and unstack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note, bug fixes in reshaping for 0.24.0
pandas/core/reshape/reshape.py
Outdated
@@ -126,6 +126,13 @@ def __init__(self, values, index, level=-1, value_columns=None, | |||
self.removed_level = self.new_index_levels.pop(self.level) | |||
self.removed_level_full = index.levels[self.level] | |||
|
|||
num_rows = np.max([index_level.size for index_level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on the checking here
can you rebase |
closing in favor of #23512 |
Because the error of number of levels exceeding int32 can happen both in
pivot_table
andunstack
, I catch the error as early as possible in both places.Arguments for and against catching the error in
pivot_table
:pivot_table
does aggregation before callingunstack
, so it takes a while before the error is raisedpivot_table
ultimately callsunstack
, so it's not worth it to check for the error in both placesI'm happy to go with either way as you prefer.
git diff upstream/master -u -- "*.py" | flake8 --diff