BUG: groupby.count fails on windows with large categorical index #15234

david-hoffman · 2017-01-26T05:01:56Z

Problem description

On windows machines the native int is int32 which causes an overflow error in cartesian_product in tools.util. The problem line is lenX = np.fromiter((len(x) for x in X), dtype=int)

def cartesian_product(X):
    """
    Numpy version of itertools.product or pandas.compat.product.
    Sometimes faster (for large inputs)...

    Parameters
    ----------
    X : list-like of list-likes

    Returns
    -------
    product : list of ndarrays

    Examples
    --------
    >>> cartesian_product([list('ABC'), [1, 2]])
    [array(['A', 'A', 'B', 'B', 'C', 'C'], dtype='|S1'),
    array([1, 2, 1, 2, 1, 2])]

    See also
    --------
    itertools.product : Cartesian product of input iterables.  Equivalent to
        nested for-loops.
    pandas.compat.product : An alias for itertools.product.
    """
    msg = "Input must be a list-like of list-likes"
    if not is_list_like(X):
        raise TypeError(msg)
    for x in X:
        if not is_list_like(x):
            raise TypeError(msg)

    if len(X) == 0:
        return []

    lenX = np.fromiter((len(x) for x in X), dtype=int)  <----- HERE
    cumprodX = np.cumproduct(lenX)

    a = np.roll(cumprodX, 1)
    a[0] = 1

    if cumprodX[-1] != 0:
        b = cumprodX[-1] / cumprodX
    else:
        # if any factor is empty, the cartesian product is empty
        b = np.zeros_like(cumprodX)

    return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
                    np.product(a[i]))
            for i, x in enumerate(X)]

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-26T10:06:36Z

show a reproducible example

jreback · 2017-01-30T14:24:32Z

can you show a specific example where this fails?

david-hoffman · 2017-01-30T15:21:02Z

ln[1]: import pandas as pd

In [2]: import numpy as np

In [3]: data = np.random.randn(1000000, 3)

In [4]: rangex = np.linspace(-2, 2, 2048)

In [5]: rangey = np.linspace(-1, 1, 1024)

In [6]: rangez = np.linspace(-3, 3, 4096)

In [7]: np.iinfo(int).max > rangex.size * rangey.size * rangez.size
Out[8]: False

In [8]: df = pd.DataFrame(data, columns=["x", "y", "z"])

In [9]: grouped = df.groupby([pd.cut(df.x, rangex), pd.cut(df.y, rangey), pd.cut(df.z, rangez)])

In [10]: grouped.count()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-220cef394518> in <module>()
----> 1 grouped.count()

C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in count(self)
   3886         blk = map(make_block, map(counter, val), loc)
   3887
-> 3888         return self._wrap_agged_blocks(data.items, list(blk))
   3889
   3890

C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in _wrap_agged_blocks(self, items, blocks)
   3801             result = result.T
   3802
-> 3803         return self._reindex_output(result)._convert(datetime=True)
   3804
   3805     def _reindex_output(self, result):

C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in _reindex_output(self, result)
   3823         levels_list = [ping.group_index for ping in groupings]
   3824         index, _ = MultiIndex.from_product(
-> 3825             levels_list, names=self.grouper.names).sortlevel()
   3826
   3827         if self.as_index:

C:\Anaconda3\lib\site-packages\pandas\indexes\multi.py in from_product(cls, iterables, sortorder, names)
   1023
   1024         labels, levels = _factorize_from_iterables(iterables)
-> 1025         labels = cartesian_product(labels)
   1026
   1027         return MultiIndex(levels=levels, labels=labels, sortorder=sortorder,

C:\Anaconda3\lib\site-packages\pandas\tools\util.py in cartesian_product(X)
     70     return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
     71                     np.product(a[i]))
---> 72             for i, x in enumerate(X)]
     73
     74

C:\Anaconda3\lib\site-packages\pandas\tools\util.py in <listcomp>(.0)
     70     return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
     71                     np.product(a[i]))
---> 72             for i, x in enumerate(X)]
     73
     74

C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
    394     except AttributeError:
    395         return _wrapit(a, 'repeat', repeats, axis)
--> 396     return repeat(repeats, axis)
    397
    398

ValueError: negative dimensions are not allowed

jreback · 2017-01-30T15:33:07Z

ok thanks. I think this is the same as #14942, though on windows this fails before getting to the eats-all-memory part.

welcome for you to do a PR to fix this.

david-hoffman · 2017-01-30T16:10:59Z

Is it possible to do a PR without forking the whole repository?

jreback · 2017-01-30T16:13:01Z

@david-hoffman theoretically (for a very very simple thing, you can do it via github), but not for this, this will require some investigation, debugging, and tests.

contribution docs are here: http://pandas.pydata.org/pandas-docs/stable/contributing.html

When the numbers in `X` are large it can cause an overflow error on windows machine where the native `int` is 32 bit. Switching to np.intp alleviates this problem. Other fixes would include switching to np.uint32 or np.uint64. closes pandas-dev#15234 Author: David Hoffman <[email protected]> Closes pandas-dev#15265 from david-hoffman/patch-1 and squashes the following commits: c9c8d5e [David Hoffman] Update v0.19.2.txt d54583e [David Hoffman] Remove `test_large_input` because it's too big 47a6c6c [David Hoffman] Update test so that it will actually run on "normal" machine 7aeee85 [David Hoffman] Added tests for large numbers b196878 [David Hoffman] Fix overflow error in cartesian_product

jreback added Groupby Windows Windows OS labels Jan 26, 2017

jreback added Bug Difficulty Intermediate labels Jan 30, 2017

jreback added this to the 0.20.0 milestone Jan 30, 2017

jreback changed the title ~~groupby.count fails on windows with large categorical index~~ BUG: groupby.count fails on windows with large categorical index Jan 30, 2017

jreback added the Categorical Categorical Data Type label Jan 30, 2017

david-hoffman mentioned this issue Jan 30, 2017

Fix overflow error in cartesian_product #15265

Closed

jreback closed this as completed in 48fc9d6 Feb 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.count fails on windows with large categorical index #15234

BUG: groupby.count fails on windows with large categorical index #15234

david-hoffman commented Jan 26, 2017

jreback commented Jan 26, 2017

jreback commented Jan 30, 2017

david-hoffman commented Jan 30, 2017

jreback commented Jan 30, 2017

david-hoffman commented Jan 30, 2017

jreback commented Jan 30, 2017

BUG: groupby.count fails on windows with large categorical index #15234

BUG: groupby.count fails on windows with large categorical index #15234

Comments

david-hoffman commented Jan 26, 2017

Problem description

Output of pd.show_versions()

jreback commented Jan 26, 2017

jreback commented Jan 30, 2017

david-hoffman commented Jan 30, 2017

jreback commented Jan 30, 2017

david-hoffman commented Jan 30, 2017

jreback commented Jan 30, 2017

Output of `pd.show_versions()`