-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pivot / unstack on large data frame does not work int32 overflow #26314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
try on master this was recently patched |
Corporate server, so long story short, I can only work with conda releases. |
https://github.com/pandas-dev/pandas/pull/23512/files if you are trying to do this approach your problem differently |
The point I'm trying to raise is: why is the number of cells limited to max value of np.int32? This creates constraints when working with large data frames. |
so you have 2B columns? |
My current dataset has num_cells = 2584251 * 4539 = 11.729.915.289 cells So, I have 11.7 B cells Putting a int32 contstraint on number of cells is way to small for my datasets. |
Btw, in pandas 0.21 I could execute these unstacks without problems. Size was not an issue. |
Yeah, I'm bumping into this as well. I'm trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index. I'm not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me. |
@Rblivingstone certainly would take a PR to allow int64 here; it’s not tested and that’s why we raise |
I'm getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64. |
Turns out I get the same error even after changing the reshape.py line to |
I get "ValueError: negative dimensions are not allowed" after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64). Any chance we have a different workaround? I only have 6000 columns... |
I have same issue. I use 9.9 BG memory data. df.pivot_table(index='uno', columns=['programid'], values='avg_time') ValueError: Unstacked DataFrame is too big, causing int32 overflow pandas 0.25.0 |
@buddingPop @Code4SAFrankie @meganson are any of you interested in working on this? We also need a reproducible example, if anyone can provide that. |
You can download the ratings.csv file from Then create a pivottable as below to reproduce the problem: |
We won't be able to include downloading that dataset in a unit test. I assume this can be reproduced with the right style of random data? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
Same issue here. @TomAugspurger Let me try to generate similar data with random data. |
This is a very naive test, but seems to re-produced the error for me. row_cnt = 4000000 c1_set = [ i for i in range(c1_unique_val_cnt)] df_test = pd.DataFrame({'c1':c1, 'c2': c2 }) Produces an error: "ValueError: Unstacked DataFrame is too big, causing int32 overflow" |
Thanks @vengertsev. A similar example that's a bit faster, since it vectorizes the column generation In [62]: df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(4000000, 2)), columns=['a', 'b'])
In [63]: df.set_index(['a', 'b']).unstack()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-2ee2ef4b1279> in <module>
----> 1 df.set_index(['a', 'b']).unstack()
~/sandbox/pandas/pandas/core/frame.py in unstack(self, level, fill_value)
6311 from pandas.core.reshape.reshape import unstack
6312
-> 6313 return unstack(self, level, fill_value)
6314
6315 _shared_docs[
~/sandbox/pandas/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
408 if isinstance(obj, DataFrame):
409 if isinstance(obj.index, MultiIndex):
--> 410 return _unstack_frame(obj, level, fill_value=fill_value)
411 else:
412 return obj.T.stack(dropna=False)
~/sandbox/pandas/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
438 value_columns=obj.columns,
439 fill_value=fill_value,
--> 440 constructor=obj._constructor,
441 )
442 return unstacker.get_result()
~/sandbox/pandas/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
135
136 if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 137 raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
138
139 self._make_sorted_values_labels()
ValueError: Unstacked DataFrame is too big, causing int32 overflow |
@TomAugspurger, Nice, thanks! |
Anyone interested in working on this now that we have a reproducible example? |
ValueError Traceback (most recent call last) ~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in unstack(self, level, fill_value) ~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value) ~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in init(self, values, index, level, value_columns, fill_value, constructor) ValueError: Unstacked DataFrame is too big, causing int32 overflow |
Hello, i am facing the same issue trying to create a pivot table , if someone can help ! Thank you |
We need a PR to fix it in pandas.
…On Wed, Sep 18, 2019 at 7:57 AM imanekho ***@***.***> wrote:
Hello,
i am facing the same issue trying to create a pivot table , if someone can
help !
Thank you
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26314?email_source=notifications&email_token=AAKAOIWZIGTP6E5QAUW6DOLQKIQR5A5CNFSM4HLQBXN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD676ZYY#issuecomment-532671715>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIX6READO4P5TIOPTILQKIQR5ANCNFSM4HLQBXNQ>
.
|
this is the only solution you think? any alternative solution may be |
You could attempt to downgrade your pandas to 0.21 (See here #26314 (comment)). I ran into a similar problem recently and downgraded and it worked on one of my datasets. It isn't the best solution but it might get you moving. |
thank you |
this is exactly the command i used |
try with Python 3.6 |
Any news on this, I'm failing to roll back to 0.21. My next course of action is to rewrite vectorised function as a loop but I'd rather not. |
I got this error in ver. |
Still open, still looking for a volunteer to fix it. |
Any update on this? |
pandas is an all volunteer project; PRs would be accepted from the community for this or any other issues |
I have been stuck with this some days and finally I have fixed it by changing int32 to int64. Would it be all right to send a pull request with it? |
@KaonToPion please do. |
@KaonToPion Please let us know once you checkin... I would like to have this fix ASAP... Thanks a lot! |
I am sorry, I am having issues with installing the pandas development enviroment that's why it's taking long |
https://pandas.pydata.org/docs/development/contributing.html, in case it helps. |
@KaonToPion also waiting on this fix. If you could please take another effort a try that would be great. Thank you! |
The current code to check for int32 overflow is wrong anyway as it will not catch all overflows (only when the result is still negative): pandas/pandas/core/reshape/reshape.py Lines 115 to 119 in a9cacd9
Correct check: # GH20601: Catch int32 overflow if the number of cells is too high.
if num_rows > 0 and num_columns > 0 and num_rows * num_columns > 2**31 - 1:
raise ValueError("Unstacked DataFrame is too big, causing int32 overflow") Examples: In [1]: import numpy as np
In [2]: def test_overflow(num_rows, num_columns):
...: num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
...: if num_rows > 0 and num_columns > 0 and num_cells <= 0:
...: print("Unstacked DataFrame is too big, causing int32 overflow (np.multiply)")
...: if num_rows > 0 and num_columns > 0 and num_rows * num_columns > 2**31 - 1:
...: print("Unstacked DataFrame is too big, causing int32 overflow (python)")
...:
In [3]: test_overflow(400_000, 1_000)
In [4]: test_overflow(4_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (np.multiply)
Unstacked DataFrame is too big, causing int32 overflow (python)
In [5]: test_overflow(40_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (python)
In [6]: test_overflow(400_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (python) |
Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!I faced the same issue and needed to have an urgent fix for the unexpected int32 overflow. One of our recommendation model was running in production and at some point number of users base increased to more than 7 million records with around 21k items. So, to solve the issue I chunked the dataset as mentioned @igorkf, create the pivot table using unstack and append it gradually.
NB: Pandas has pivot_table function which can be used for unstacking if your data is small. For my case, pivot_table was really slow |
Did this ever get patched? :) |
nope. I don't think so.I moved to alternate ways of doing it.
On Monday, 14 June, 2021, 04:08:57 pm GMT-4, lucashadin ***@***.***> wrote:
Did this ever get patched? :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
How did you solve it? Did you use the chunking method that kamal1262 mentioned? Or another package? Any advice would be great :) |
Code Sample, a copy-pastable example if possible
Problem description
I'm working on rather large data (>100GB in memory) on a beefy server (3TB ram)
When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.
Unstacked DataFrame is too big, causing int32 overflow
I was able to eliminate the problem by changing the reshape.py:
modify line 121from dtype np.int32 to dtype np.int64:
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)
Expected Output
Not being limited by int32 dims when reshaping a data frame.
This feels like a restriction which should not be there.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.0
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: