ENH GH20601 raise error when pivot table's number of levels > int32 #20784

anhqle · 2018-04-22T06:43:58Z

Because the error of number of levels exceeding int32 can happen both in pivot_table and unstack, I catch the error as early as possible in both places.

Arguments for and against catching the error in pivot_table:

For: pivot_table does aggregation before calling unstack, so it takes a while before the error is raised
Against: pivot_table ultimately calls unstack, so it's not worth it to check for the error in both places

I'm happy to go with either way as you prefer.

[X ] closes ERR: pivot_table when number of levels larger than int32 range #20601
[X ] tests added / passed
[X ] passes git diff upstream/master -u -- "*.py" | flake8 --diff
[X ] whatsnew entry

pep8speaks · 2018-04-22T06:44:04Z

Hello @anhqle! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 31, 2018 at 17:22 Hours UTC

codecov · 2018-04-22T07:46:54Z

Codecov Report

Merging #20784 into master will decrease coverage by <.01%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master   #20784      +/-   ##
==========================================
- Coverage   92.07%   92.06%   -0.01%     
==========================================
  Files         170      170              
  Lines       50688    50693       +5     
==========================================
+ Hits        46671    46672       +1     
- Misses       4017     4021       +4

Flag	Coverage Δ
#multiple	`90.47% <80%> (-0.01%)`	⬇️
#single	`42.28% <0%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/pivot.py	`97.03% <100%> (ø)`	⬆️
pandas/core/reshape/reshape.py	`99.57% <75%> (-0.22%)`	⬇️
pandas/core/dtypes/common.py	`94.87% <0%> (-0.34%)`	⬇️
pandas/util/testing.py	`85.69% <0%> (-0.21%)`	⬇️
pandas/core/arrays/datetimes.py	`95.44% <0%> (-0.03%)`	⬇️
pandas/core/dtypes/dtypes.py	`96.05% <0%> (+0.02%)`	⬆️
pandas/core/indexes/period.py	`93.5% <0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c272c52...7e6246c. Read the comment docs.

jreback · 2018-04-22T14:28:40Z

generally use the original PR and push to it. having a new one is confusing.

jreback · 2018-04-22T14:29:48Z

pandas/core/reshape/pivot.py

@@ -29,6 +29,11 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
    index = _convert_by(index)
    columns = _convert_by(columns)

+    num_rows = data.reindex(index, axis='columns').shape[0]


you are doing extra work here (e.g. the reindex), can we not do the calculation directly?

jreback · 2018-04-22T14:30:34Z

pandas/core/reshape/pivot.py

@@ -29,6 +29,11 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
    index = _convert_by(index)
    columns = _convert_by(columns)

+    num_rows = data.reindex(index, axis='columns').shape[0]
+    num_columns = data.reindex(columns, axis='columns').shape[0]


also this is error is now in 2 places

anhqle · 2018-04-22T18:03:10Z

pandas/core/reshape/pivot.py

+    num_rows = (data.reindex(columns=index).drop_duplicates().shape[0]
+                if index else 1)
+    num_cols = (data.reindex(columns=columns).drop_duplicates().shape[0]
+                if columns else 1)


The goal here is to get an accurate size of the resulting pivot table, taking into account potential duplicates.

However I recognize that the performance hit may not be worth it given that this is such an edge case. It's up to you whether we should just raise the error within unstack.

anhqle · 2018-04-22T18:10:32Z

My apology about creating a new PR. I pull --rebase upstream master in the original PR and the diff got polluted by other commits. Am I supposed to pull --rebase upstream master, or the maintainers are in charge of the final merge into master?

Thanks again for your guidance -- I'm learning my way through the codebase and hope to be an active contributor in the future.

jreback · 2018-04-22T23:34:20Z

your u should rebase against upstream every time you push - it makes it so your code is not out of date wrt to master
but you don’t need to squash

anhqle · 2018-06-09T16:34:19Z

Do you think this PR is ready to be merged? Happy to do any additional work.

jreback · 2018-07-28T14:37:45Z

can you rebase

… larger than int32

…verflow

…: in pivot_table and unstack

…'s too big

jreback

can you add a whatsnew note, bug fixes in reshaping for 0.24.0

jreback · 2018-07-31T13:14:03Z

pandas/core/reshape/reshape.py

@@ -126,6 +126,13 @@ def __init__(self, values, index, level=-1, value_columns=None,
        self.removed_level = self.new_index_levels.pop(self.level)
        self.removed_level_full = index.levels[self.level]

+        num_rows = np.max([index_level.size for index_level


can you add a comment on the checking here

jreback · 2018-09-25T14:08:43Z

can you rebase

jreback · 2018-11-11T16:29:17Z

closing in favor of #23512

anhqle mentioned this pull request Apr 22, 2018

ENH GH20601 raise error when pivot table's number of levels > int32 #20709

Closed

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas Numeric Operations Arithmetic, Comparison, and Logical operations labels Apr 22, 2018

jreback requested changes Apr 22, 2018

View reviewed changes

anhqle commented Apr 22, 2018

View reviewed changes

anhqle added 7 commits July 30, 2018 11:40

ENH GH20601 raise an error when the number of levels in a pivot table…

1f5ed03

… larger than int32

TST add a test for pivot table large number of levels causing int32 o…

e78e82a

…verflow

CLN PEP8 compliance

5d773ef

ENH catch the int32 overflow error earlier and in two separate places…

01a7943

…: in pivot_table and unstack

CLN PEP8 compliance

0efaa8e

ENH calculate size of the resulting pivot table and raise error if it…

8edc9a0

…'s too big

rebase onto upstream master

f2021f1

jreback requested changes Jul 31, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jul 31, 2018

DOC add whatsnew and comments explaining the bug fix

7e6246c

jreback mentioned this pull request Nov 5, 2018

BUG: pivot/unstack leading to too many items should raise exception #23512

Merged

4 tasks

jreback removed this from the 0.24.0 milestone Nov 6, 2018

jreback closed this Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

anhqle commented Apr 22, 2018

pep8speaks commented Apr 22, 2018 •

edited

Loading

codecov bot commented Apr 22, 2018 •

edited

Loading

jreback commented Apr 22, 2018

jreback Apr 22, 2018

jreback Apr 22, 2018

anhqle Apr 22, 2018

anhqle commented Apr 22, 2018

jreback commented Apr 22, 2018

anhqle commented Jun 9, 2018

jreback commented Jul 28, 2018

jreback left a comment

jreback Jul 31, 2018

jreback commented Sep 25, 2018

jreback commented Nov 11, 2018

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

Conversation

anhqle commented Apr 22, 2018

pep8speaks commented Apr 22, 2018 • edited Loading

Comment last updated on July 31, 2018 at 17:22 Hours UTC

codecov bot commented Apr 22, 2018 • edited Loading

Codecov Report

jreback commented Apr 22, 2018

jreback Apr 22, 2018

Choose a reason for hiding this comment

jreback Apr 22, 2018

Choose a reason for hiding this comment

anhqle Apr 22, 2018

Choose a reason for hiding this comment

anhqle commented Apr 22, 2018

jreback commented Apr 22, 2018

anhqle commented Jun 9, 2018

jreback commented Jul 28, 2018

jreback left a comment

Choose a reason for hiding this comment

jreback Jul 31, 2018

Choose a reason for hiding this comment

jreback commented Sep 25, 2018

jreback commented Nov 11, 2018

pep8speaks commented Apr 22, 2018 •

edited

Loading

codecov bot commented Apr 22, 2018 •

edited

Loading