-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby.count fails on windows with large categorical index #15234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
show a reproducible example |
can you show a specific example where this fails? |
ln[1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = np.random.randn(1000000, 3)
In [4]: rangex = np.linspace(-2, 2, 2048)
In [5]: rangey = np.linspace(-1, 1, 1024)
In [6]: rangez = np.linspace(-3, 3, 4096)
In [7]: np.iinfo(int).max > rangex.size * rangey.size * rangez.size
Out[8]: False
In [8]: df = pd.DataFrame(data, columns=["x", "y", "z"])
In [9]: grouped = df.groupby([pd.cut(df.x, rangex), pd.cut(df.y, rangey), pd.cut(df.z, rangez)])
In [10]: grouped.count()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-220cef394518> in <module>()
----> 1 grouped.count()
C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in count(self)
3886 blk = map(make_block, map(counter, val), loc)
3887
-> 3888 return self._wrap_agged_blocks(data.items, list(blk))
3889
3890
C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in _wrap_agged_blocks(self, items, blocks)
3801 result = result.T
3802
-> 3803 return self._reindex_output(result)._convert(datetime=True)
3804
3805 def _reindex_output(self, result):
C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in _reindex_output(self, result)
3823 levels_list = [ping.group_index for ping in groupings]
3824 index, _ = MultiIndex.from_product(
-> 3825 levels_list, names=self.grouper.names).sortlevel()
3826
3827 if self.as_index:
C:\Anaconda3\lib\site-packages\pandas\indexes\multi.py in from_product(cls, iterables, sortorder, names)
1023
1024 labels, levels = _factorize_from_iterables(iterables)
-> 1025 labels = cartesian_product(labels)
1026
1027 return MultiIndex(levels=levels, labels=labels, sortorder=sortorder,
C:\Anaconda3\lib\site-packages\pandas\tools\util.py in cartesian_product(X)
70 return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
71 np.product(a[i]))
---> 72 for i, x in enumerate(X)]
73
74
C:\Anaconda3\lib\site-packages\pandas\tools\util.py in <listcomp>(.0)
70 return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
71 np.product(a[i]))
---> 72 for i, x in enumerate(X)]
73
74
C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
394 except AttributeError:
395 return _wrapit(a, 'repeat', repeats, axis)
--> 396 return repeat(repeats, axis)
397
398
ValueError: negative dimensions are not allowed |
ok thanks. I think this is the same as #14942, though on windows this fails before getting to the eats-all-memory part. welcome for you to do a PR to fix this. |
groupby.count
fails on windows with large categorical index
Is it possible to do a PR without forking the whole repository? |
@david-hoffman theoretically (for a very very simple thing, you can do it via github), but not for this, this will require some investigation, debugging, and tests. contribution docs are here: http://pandas.pydata.org/pandas-docs/stable/contributing.html |
When the numbers in `X` are large it can cause an overflow error on windows machine where the native `int` is 32 bit. Switching to np.intp alleviates this problem. Other fixes would include switching to np.uint32 or np.uint64. closes pandas-dev#15234 Author: David Hoffman <[email protected]> Closes pandas-dev#15265 from david-hoffman/patch-1 and squashes the following commits: c9c8d5e [David Hoffman] Update v0.19.2.txt d54583e [David Hoffman] Remove `test_large_input` because it's too big 47a6c6c [David Hoffman] Update test so that it will actually run on "normal" machine 7aeee85 [David Hoffman] Added tests for large numbers b196878 [David Hoffman] Fix overflow error in cartesian_product
Problem description
On windows machines the native
int
isint32
which causes an overflow error incartesian_product
intools.util
. The problem line islenX = np.fromiter((len(x) for x in X), dtype=int)
Output of
pd.show_versions()
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: