Skip to content

BUG: groupby.count fails on windows with large categorical index #15234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
david-hoffman opened this issue Jan 26, 2017 · 6 comments
Closed

BUG: groupby.count fails on windows with large categorical index #15234

david-hoffman opened this issue Jan 26, 2017 · 6 comments
Labels
Bug Categorical Categorical Data Type Groupby Windows Windows OS
Milestone

Comments

@david-hoffman
Copy link
Contributor

Problem description

On windows machines the native int is int32 which causes an overflow error in cartesian_product in tools.util. The problem line is lenX = np.fromiter((len(x) for x in X), dtype=int)

def cartesian_product(X):
    """
    Numpy version of itertools.product or pandas.compat.product.
    Sometimes faster (for large inputs)...

    Parameters
    ----------
    X : list-like of list-likes

    Returns
    -------
    product : list of ndarrays

    Examples
    --------
    >>> cartesian_product([list('ABC'), [1, 2]])
    [array(['A', 'A', 'B', 'B', 'C', 'C'], dtype='|S1'),
    array([1, 2, 1, 2, 1, 2])]

    See also
    --------
    itertools.product : Cartesian product of input iterables.  Equivalent to
        nested for-loops.
    pandas.compat.product : An alias for itertools.product.
    """
    msg = "Input must be a list-like of list-likes"
    if not is_list_like(X):
        raise TypeError(msg)
    for x in X:
        if not is_list_like(x):
            raise TypeError(msg)

    if len(X) == 0:
        return []

    lenX = np.fromiter((len(x) for x in X), dtype=int)  <----- HERE
    cumprodX = np.cumproduct(lenX)

    a = np.roll(cumprodX, 1)
    a[0] = 1

    if cumprodX[-1] != 0:
        b = cumprodX[-1] / cumprodX
    else:
        # if any factor is empty, the cartesian product is empty
        b = np.zeros_like(cumprodX)

    return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
                    np.product(a[i]))
            for i, x in enumerate(X)]

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 26, 2017

show a reproducible example

@jreback jreback added Groupby Windows Windows OS labels Jan 26, 2017
@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

can you show a specific example where this fails?

@david-hoffman
Copy link
Contributor Author

ln[1]: import pandas as pd

In [2]: import numpy as np

In [3]: data = np.random.randn(1000000, 3)

In [4]: rangex = np.linspace(-2, 2, 2048)

In [5]: rangey = np.linspace(-1, 1, 1024)

In [6]: rangez = np.linspace(-3, 3, 4096)

In [7]: np.iinfo(int).max > rangex.size * rangey.size * rangez.size
Out[8]: False

In [8]: df = pd.DataFrame(data, columns=["x", "y", "z"])

In [9]: grouped = df.groupby([pd.cut(df.x, rangex), pd.cut(df.y, rangey), pd.cut(df.z, rangez)])

In [10]: grouped.count()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-220cef394518> in <module>()
----> 1 grouped.count()

C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in count(self)
   3886         blk = map(make_block, map(counter, val), loc)
   3887
-> 3888         return self._wrap_agged_blocks(data.items, list(blk))
   3889
   3890

C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in _wrap_agged_blocks(self, items, blocks)
   3801             result = result.T
   3802
-> 3803         return self._reindex_output(result)._convert(datetime=True)
   3804
   3805     def _reindex_output(self, result):

C:\Anaconda3\lib\site-packages\pandas\core\groupby.py in _reindex_output(self, result)
   3823         levels_list = [ping.group_index for ping in groupings]
   3824         index, _ = MultiIndex.from_product(
-> 3825             levels_list, names=self.grouper.names).sortlevel()
   3826
   3827         if self.as_index:

C:\Anaconda3\lib\site-packages\pandas\indexes\multi.py in from_product(cls, iterables, sortorder, names)
   1023
   1024         labels, levels = _factorize_from_iterables(iterables)
-> 1025         labels = cartesian_product(labels)
   1026
   1027         return MultiIndex(levels=levels, labels=labels, sortorder=sortorder,

C:\Anaconda3\lib\site-packages\pandas\tools\util.py in cartesian_product(X)
     70     return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
     71                     np.product(a[i]))
---> 72             for i, x in enumerate(X)]
     73
     74

C:\Anaconda3\lib\site-packages\pandas\tools\util.py in <listcomp>(.0)
     70     return [np.tile(np.repeat(np.asarray(com._values_from_object(x)), b[i]),
     71                     np.product(a[i]))
---> 72             for i, x in enumerate(X)]
     73
     74

C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
    394     except AttributeError:
    395         return _wrapit(a, 'repeat', repeats, axis)
--> 396     return repeat(repeats, axis)
    397
    398

ValueError: negative dimensions are not allowed

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

ok thanks. I think this is the same as #14942, though on windows this fails before getting to the eats-all-memory part.

welcome for you to do a PR to fix this.

@jreback jreback added this to the 0.20.0 milestone Jan 30, 2017
@jreback jreback changed the title groupby.count fails on windows with large categorical index BUG: groupby.count fails on windows with large categorical index Jan 30, 2017
@jreback jreback added the Categorical Categorical Data Type label Jan 30, 2017
@david-hoffman
Copy link
Contributor Author

Is it possible to do a PR without forking the whole repository?

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

@david-hoffman theoretically (for a very very simple thing, you can do it via github), but not for this, this will require some investigation, debugging, and tests.

contribution docs are here: http://pandas.pydata.org/pandas-docs/stable/contributing.html

@jreback jreback closed this as completed in 48fc9d6 Feb 1, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
When the numbers in `X` are large it can cause an overflow error on
windows machine where the native `int` is 32 bit. Switching to np.intp
alleviates this problem.    Other fixes would include switching to
np.uint32 or np.uint64.

closes pandas-dev#15234

Author: David Hoffman <[email protected]>

Closes pandas-dev#15265 from david-hoffman/patch-1 and squashes the following commits:

c9c8d5e [David Hoffman] Update v0.19.2.txt
d54583e [David Hoffman] Remove `test_large_input` because it's too big
47a6c6c [David Hoffman] Update test so that it will actually run on "normal" machine
7aeee85 [David Hoffman] Added tests for large numbers
b196878 [David Hoffman] Fix overflow error in cartesian_product
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby Windows Windows OS
Projects
None yet
Development

No branches or pull requests

2 participants