Skip to content

group_by produces 'minlength must be positive error' when applied to empty DataFrame #11699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sereger13 opened this issue Nov 25, 2015 · 8 comments · Fixed by #11709
Closed
Milestone

Comments

@Sereger13
Copy link
Contributor

This used to work fine in previous versions but appears to be broken in 0.17.1

The following code:

import pandas as pd
df = pd.DataFrame({'A': [], 'B': []})
gb = df.groupby('A') .size()

Produces this error:

ValueError: minlength must be positive

In v 0.16.2 the same code produced an empty DataFrame. We'd really like to upgrade to 0.17.1 but heavily rely on this functionality so have to hold the upgrade. Checking for empty DataFrame is not going to work for us either as there are too many places where it can actually be empty.

If you can suggest any workaround in the meantime so we could upgrade that would be appreciated.

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-238.9.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.16.2
...

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

cc @behzadnouri

@Sereger13 I don't think their is an easy way around this w/o resorting to patching DataFrame.groupby to catch this situation (which while messy and normally nor recommended may work for you temporarily).

@Sereger13
Copy link
Contributor Author

I see...

We found that this code:
count().iloc[:, 0]
produces very similar results to size() and seems to be working for us - but does not look particularly attractive so still deciding whether to have it or not.

If you do decide to fix size() - is there any idea when the next version/patch is going to be available? Thanks..

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

will be fixed; 0.18.0 prob later january

@Sereger13
Copy link
Contributor Author

Thanks.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

@Sereger13 my point about patching is that you can avoid any code changes.

note again that is a 'hack' but will work.

e.g.

In [109]: df1 = pd.DataFrame({'A': [], 'B': []})

In [110]: df2 = pd.DataFrame({'A': [1,2,1], 'B': [1,2,3]})

In [116]: def size(self):
   .....:     try:
   .....:         return self.grouper.size()
   .....:     except ValueError:
   .....:         self._set_selection_from_grouper()
   .....:         return self._selected_obj[0:0]
   .....:     

In [117]: pandas.core.groupby.GroupBy.size = size

In [118]: df1.groupby('A').size()
Out[118]: 
Empty DataFrame
Columns: [B]
Index: []

In [119]: df2.groupby('A').size()
Out[119]: 
A
1    2
2    1
dtype: int64

@Sereger13
Copy link
Contributor Author

Great - thanks for your help.

@behzadnouri
Copy link
Contributor

This is more a bug in np.bincount because unnecessarily requires minlength to be strictly positive. though kind of ugly, the work-around would be simple:

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index e9aa906..d722ef8 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -1439,7 +1439,8 @@ class BaseGrouper(object):
         """
         ids, _, ngroup = self.group_info
         ids = com._ensure_platform_int(ids)
-        out = np.bincount(ids[ids != -1], minlength=ngroup)
+        mask = ids != -1
+        out = np.bincount(ids[mask], minlength=ngroup) if ngroup != 0 else []
         return Series(out, index=self.result_index, dtype='int64')

     @cache_readonly

@Sereger13
Copy link
Contributor Author

Interesting... thanks for the update. Yes they could have made np.bincount() better indeed - allowing either None or 0 having the same meaning would make it more usable.

So it looks like simply setting ngroup to None should also do the trick:

if not ngroup:
    ngroup=None
out = np.bincount(ids[ids != -1], minlength=ngroup)

Not sure this is more readable than @behzadnouri's solution though. Looking forward for a new pandas with the workaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants