ERR: pivot_table when number of levels larger than int32 range #20601

mklwong · 2018-04-04T02:20:11Z

Code Sample, a copy-pastable example if possible

import pandas as pd
dat = pd.DataFrame({'ind1':list(range(1337600))*2,'ind2':list(range(3040))*2*440,'count':[1]*2*1337600})
dat.pivot_table(index='ind1',columns='ind2',values='count',aggfunc='count')

Problem description

Above code raises the following error:

  File "..\pandas\core\reshape\reshape.py", line 144, in _make_selectors
    mask = np.zeros(np.prod(self.full_shape), dtype=bool)

ValueError: negative dimensions are not allowed

np.prod(self.full_shape) appears to be returning a negative value because the number of unique index combinations is larger than the largest int32 value.

If line 144 were changed to the following, the issue could be fixed:

mask = np.zeros(np.prod(self.full_shape,dtype=np.int64), dtype=bool)

Expected Output

1337600 x 3040 dataframe.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 37 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.12.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-04-04T02:30:16Z

you are creating a frame with 4B entries? what are you going to do with that?

mklwong · 2018-04-04T02:53:09Z

I have a dataset with over a million samples and a few thousand features which I'm preprocessing. The raw data is in stacked form at the moment which is what I'm attempting to pivot_table.

gfyoung · 2018-04-10T04:21:18Z

mask = np.zeros(np.prod(self.full_shape,dtype=np.int64), dtype=bool)

I suppose we could do this, but this also has a noticeable memory impact (1 byte vs 8 bytes). Does this patch actually fix your problem?

mklwong · 2018-04-10T05:04:28Z

It does fix the problem, but I do understand the memory impact it could cause.

If this patch has too much of a negative impact on memory usage, I think throwing an error indicting that np.prod will lead to integer overflow when the pivot table is too large would be enough. That would inform the user that they need to reduce the size of the data to resolve the problem since the ValueError that actually got raised made it hard to track down the problem.

Thanks!

gfyoung · 2018-04-10T05:12:31Z

I'm more inclined to do error message, as this use case almost rarely shows up. Feel free to do a PR!

jreback · 2018-04-10T12:23:49Z

@mklwong I think showing an error message is reasonable here. Note you can actually catch this in the _Unstacker constructor (or possibly earlier)

Sandy4321 · 2024-01-22T22:29:23Z

still has this error for big data

gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 10, 2018

jreback added Error Reporting Incorrect or improved errors from pandas Effort Low good first issue labels Apr 10, 2018

jreback added this to the Next Major Release milestone Apr 10, 2018

jreback changed the title ~~Bug in pivot_table when number of levels larger than int32 range~~ ERR: pivot_table when number of levels larger than int32 range Apr 10, 2018

This was referenced Apr 16, 2018

ENH GH20601 raise error when pivot table's number of levels > int32 #20709

Closed

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

Closed

jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 31, 2018

sweb mentioned this issue Nov 5, 2018

BUG: pivot/unstack leading to too many items should raise exception #23512

Merged

4 tasks

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

jreback modified the milestones: Contributions Welcome, 0.24.0 Dec 30, 2018

jreback closed this as completed in #23512 Dec 31, 2018

Rblivingstone mentioned this issue Jun 27, 2019

Pivot / unstack on large data frame does not work int32 overflow #26314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: pivot_table when number of levels larger than int32 range #20601

ERR: pivot_table when number of levels larger than int32 range #20601

mklwong commented Apr 4, 2018 •

edited

Loading

INSTALLED VERSIONS

jreback commented Apr 4, 2018

mklwong commented Apr 4, 2018

gfyoung commented Apr 10, 2018

mklwong commented Apr 10, 2018

gfyoung commented Apr 10, 2018

jreback commented Apr 10, 2018

Sandy4321 commented Jan 22, 2024

ERR: pivot_table when number of levels larger than int32 range #20601

ERR: pivot_table when number of levels larger than int32 range #20601

Comments

mklwong commented Apr 4, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 4, 2018

mklwong commented Apr 4, 2018

gfyoung commented Apr 10, 2018

mklwong commented Apr 10, 2018

gfyoung commented Apr 10, 2018

jreback commented Apr 10, 2018

Sandy4321 commented Jan 22, 2024

mklwong commented Apr 4, 2018 •

edited

Loading

Output of `pd.show_versions()`