-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ERR: pivot_table when number of levels larger than int32 range #20601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you are creating a frame with 4B entries? what are you going to do with that? |
I have a dataset with over a million samples and a few thousand features which I'm preprocessing. The raw data is in stacked form at the moment which is what I'm attempting to pivot_table. |
mask = np.zeros(np.prod(self.full_shape,dtype=np.int64), dtype=bool) I suppose we could do this, but this also has a noticeable memory impact (1 byte vs 8 bytes). Does this patch actually fix your problem? |
It does fix the problem, but I do understand the memory impact it could cause. If this patch has too much of a negative impact on memory usage, I think throwing an error indicting that np.prod will lead to integer overflow when the pivot table is too large would be enough. That would inform the user that they need to reduce the size of the data to resolve the problem since the ValueError that actually got raised made it hard to track down the problem. Thanks! |
I'm more inclined to do error message, as this use case almost rarely shows up. Feel free to do a PR! |
@mklwong I think showing an error message is reasonable here. Note you can actually catch this in the _Unstacker constructor (or possibly earlier) |
still has this error for big data |
Code Sample, a copy-pastable example if possible
Problem description
Above code raises the following error:
np.prod(self.full_shape) appears to be returning a negative value because the number of unique index combinations is larger than the largest int32 value.
If line 144 were changed to the following, the issue could be fixed:
Expected Output
1337600 x 3040 dataframe.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 37 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.12.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: