groupby, as_index=False, with pandas.Series.count() as an agg #8381

marctollin · 2014-09-24T14:19:15Z

Why doesn't the pandas.Series.count() method work as a valid aggregation with groupby when as_index=False?

df=pd.DataFrame([['foo','foo','bar','bar','bar','oats'],[1.0,2.0,3.0,4.0,4.0,5.0],[2.0,3.0,4.0,5.0,1.0,5.0]]).T
df.columns=['mycat','var1','var2']
df.var1=df.var1.astype('int64')
df.var2=df.var2.astype('int64')
df

Now, if I try to do a group by

df.groupby('mycat', as_index=False).var1.count()

Here is the error I get:


ValueError                                Traceback (most recent call last)
<ipython-input-383-27b244bccb8b> in <module>()
----> 1 df.groupby('mycat', as_index=False).var1.count()

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in count(self, axis)
    740 
    741     def count(self, axis=0):
--> 742         return self._count().astype('int64')
    743 
    744     def ohlc(self):

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error)
   2096 
   2097         mgr = self._data.astype(
-> 2098             dtype=dtype, copy=copy, raise_on_error=raise_on_error)
   2099         return self._constructor(mgr).__finalize__(self)
   2100 

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, **kwargs)
   2235 
   2236     def astype(self, dtype, **kwargs):
-> 2237         return self.apply('astype', dtype=dtype, **kwargs)
   2238 
   2239     def convert(self, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
   2190                                                  copy=align_copy)
   2191 
-> 2192             applied = getattr(b, f)(**kwargs)
   2193 
   2194             if isinstance(applied, list):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values)
    319     def astype(self, dtype, copy=False, raise_on_error=True, values=None):
    320         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 321                             values=values)
    322 
    323     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
    337             if values is None:
    338                 # _astype_nansafe works fine with 1-d only
--> 339                 values = com._astype_nansafe(self.values.ravel(), dtype, copy=True)
    340                 values = values.reshape(self.values.shape)
    341             newb = make_block(values,

/usr/local/lib/python2.7/dist-packages/pandas/core/common.pyc in _astype_nansafe(arr, dtype, copy)
   2410     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
   2411         # work around NumPy brokenness, #1987
-> 2412         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
   2413     elif issubclass(dtype.type, compat.string_types):
   2414         return lib.astype_str(arr.ravel()).reshape(arr.shape)

/usr/local/lib/python2.7/dist-packages/pandas/lib.so in pandas.lib.astype_intsafe (pandas/lib.c:13456)()

/usr/local/lib/python2.7/dist-packages/pandas/lib.so in util.set_value_at (pandas/lib.c:55994)()

ValueError: invalid literal for long() with base 10: 'bar'

When i set as_index=True, I get

df.groupby('mycat', as_index=True).var1.count()

When I change the agg function and set_index=False, I get a weird result tooL

df.groupby('mycat', as_index=False).var1.agg(np.count_nonzero)

UPDATE: Realized my last result was not counting correctly and am now thoroughly confused.

The text was updated successfully, but these errors were encountered:

jreback · 2014-09-24T14:22:03Z

show pd.show_versions()

marctollin · 2014-09-24T14:23:12Z


INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.1
Cython: None
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: None
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

marctollin · 2014-09-24T14:31:08Z

I updated my original comment--I realized my last example didn't make sense..was counting strangely (according to my intuition)

jreback · 2014-09-24T14:34:03Z

count has a different implementation that most other methods (e.g. they are in cython or are essentially a python loop over the groups). count you get for 'free' as it is available by definition when you group. The other routines handle nuiscance columns (e.g. trying to perform a numeric operation on a string column) by excluding them. count needs to do the same. This works for as_index=True because by definition the grouping column is excluded.

care to do a pull-request?

marctollin · 2014-09-24T14:38:42Z

Sorry, but I have no C programming experience, otherwise I would help. Hopefully other's find value in fixing this bug.

jreback · 2014-09-24T14:39:17Z

no c involved

all python

marctollin · 2014-09-24T14:55:27Z

Ok! me and @jcauteru will give it a try

jreback · 2014-09-24T14:56:54Z

gr8!

create a test to compare against an expected result
thrn step thru to figure out where it's going wrong

keep in mind their are lots of other tests which have to pass as well

marctollin · 2014-11-06T16:05:46Z

FYI, I haven't forgotten about this. What's going wrong is that "astype('int64') is being applied to the nuisance columns (the strings). The bug can be fixed (at least for this small test case originally posted) by removing the requirement that the count is of the dtype int64 or, alternatively, by passing the function to _python_agg_general which iterates through everything except the exclusions in groupby.py.

Both of these fixes fail the nose tests (primarily AssertionError: attr is not equal [dtype]: dtype('float64') != dtype('int64')) so I'm exploring a different method, perhaps requiring int64 at a different point in routine. @jcauteru

livia-b · 2015-07-25T12:23:41Z

I am experiencing a similar problem when the column used for groupby is of type float. No exception is raised, but the resulting column in casted to int64:

      bug_dataset = pd.DataFrame(np.array([[ 0.,  0. ], [-0.9, 1]]), columns = ['x','y'])
      print bug_dataset.groupby('x',as_index=False).count()

x y
0 0 1
1 0 1

My version is :
INSTALLED VERSIONS

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-86-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.18
numpy: 1.9.2
scipy: 0.9.0
statsmodels: 0.6.1
IPython: 3.0.0-b1
sphinx: None
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)

livia-b · 2015-07-25T20:07:10Z

Is this issue related to #10355 ?

jreback · 2015-07-26T01:56:34Z

they look similar. You want to take a crack at writing some tests and use the fix I suggested in that issue to see if it fixes?

livia-b · 2015-07-26T19:04:49Z

I will try but it will take some time.... I am still having problems with the test environment (I've cloned the repository and run the existing tests, before changing any code, and I get FAILED (SKIP=543, errors=1, failures=2) , now I am trying to checkout a release tag ). I am also using a less powerful computer for development.

stephenpascoe · 2015-09-30T10:46:24Z

I've been stung with this issue too. Running @livia-b's test on latest master gives a different error:

In [17]: df.groupby('mycat', as_index=False).var1.count()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-27b244bccb8b> in <module>()
----> 1 df.groupby('mycat', as_index=False).var1.count()

/Users/spascoe/git/pandas/pandas/core/groupby.pyc in count(self)
   3510         #import pdb; pdb.set_trace()
   3511 
-> 3512         return self._wrap_agged_blocks(data.items, list(blk))
   3513 
   3514 

/Users/spascoe/git/pandas/pandas/lib.pyx in pandas.lib.count_level_2d (pandas/lib.c:22921)()
   1271         with nogil:
   1272             for i from 0 <= i < n:
-> 1273                 for j from 0 <= j < k:
   1274                     counts[labels[i], j] += mask[i, j]
   1275 

TypeError: count_level_2d() got an unexpected keyword argument 'axis'

It works for as_index=True but this doesn't work. I thought this should return a dataframe counting var1 and var2:

In [18]: df.groupby('mycat').count()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-921f07800bb6> in <module>()
----> 1 df.groupby('mycat').count()

/Users/spascoe/git/pandas/pandas/core/groupby.pyc in count(self)
   3510         #import pdb; pdb.set_trace()
   3511 
-> 3512         return self._wrap_agged_blocks(data.items, list(blk))
   3513 
   3514 

/Users/spascoe/git/pandas/pandas/lib.pyx in pandas.lib.count_level_2d (pandas/lib.c:22921)()
   1271         with nogil:
   1272             for i from 0 <= i < n:
-> 1273                 for j from 0 <= j < k:
   1274                     counts[labels[i], j] += mask[i, j]
   1275 

TypeError: count_level_2d() got an unexpected keyword argument 'axis'

stephenpascoe · 2015-09-30T10:50:26Z

It seems to work for SeriesGroupBy objects but not DataFrameGroupBy. We only get a SeriesGroupBy if we select a column after using as_index=True:

>>> type(df.groupby('mycat').var1)
<class 'pandas.core.groupby.SeriesGroupBy'>

>>> type(df.groupby('mycat'))
<class 'pandas.core.groupby.DataFrameGroupBy'>
>>> type(df.groupby('mycat', as_index=False))
<class 'pandas.core.groupby.DataFrameGroupBy'>
>>> type(df.groupby('mycat', as_index=False).var1)
<class 'pandas.core.groupby.DataFrameGroupBy'>

jreback · 2015-09-30T11:26:27Z

These all are fixed in master (you need a very recent master)

In [11]: df=pd.DataFrame([['foo','foo','bar','bar','bar','oats'],[1.0,2.0,3.0,4.0,4.0,5.0],[2.0,3.0,4.0,5.0,1.0,5.0]]).T

In [12]: df.columns=['mycat','var1','var2']

In [13]: df.var1=df.var1.astype('int64')

In [14]: df.var2=df.var2.astype('int64')

In [15]: df
Out[15]: 
  mycat  var1  var2
0   foo     1     2
1   foo     2     3
2   bar     3     4
3   bar     4     5
4   bar     4     1
5  oats     5     5

In [17]: df.groupby('mycat').var1.count()
Out[17]: 
mycat
bar     3
foo     2
oats    1
Name: var1, dtype: int64

In [18]: pd.__version__
Out[18]: '0.17.0rc1+115.g274abee'

In [19]: df.groupby('mycat').count()
Out[19]: 
       var1  var2
mycat            
bar       3     3
foo       2     2
oats      1     1

xref #11079 (it was fixed in an earlier commit)

jreback added Bug Groupby labels Sep 24, 2014

jreback added this to the 0.15.1 milestone Sep 24, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback closed this as completed Sep 30, 2015

jreback modified the milestones: 0.17.0, Next Major Release Sep 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby, as_index=False, with pandas.Series.count() as an agg #8381

groupby, as_index=False, with pandas.Series.count() as an agg #8381

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Sep 24, 2014

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Nov 6, 2014

livia-b commented Jul 25, 2015

livia-b commented Jul 25, 2015

jreback commented Jul 26, 2015

livia-b commented Jul 26, 2015

stephenpascoe commented Sep 30, 2015

stephenpascoe commented Sep 30, 2015

jreback commented Sep 30, 2015

groupby, as_index=False, with pandas.Series.count() as an agg #8381

groupby, as_index=False, with pandas.Series.count() as an agg #8381

Comments

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Sep 24, 2014

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Sep 24, 2014

jreback commented Sep 24, 2014

marctollin commented Nov 6, 2014

livia-b commented Jul 25, 2015

livia-b commented Jul 25, 2015

jreback commented Jul 26, 2015

livia-b commented Jul 26, 2015

stephenpascoe commented Sep 30, 2015

stephenpascoe commented Sep 30, 2015

jreback commented Sep 30, 2015