Skip to content

groupby, as_index=False, with pandas.Series.count() as an agg #8381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marctollin opened this issue Sep 24, 2014 · 16 comments
Closed

groupby, as_index=False, with pandas.Series.count() as an agg #8381

marctollin opened this issue Sep 24, 2014 · 16 comments
Milestone

Comments

@marctollin
Copy link

Why doesn't the pandas.Series.count() method work as a valid aggregation with groupby when as_index=False?

df=pd.DataFrame([['foo','foo','bar','bar','bar','oats'],[1.0,2.0,3.0,4.0,4.0,5.0],[2.0,3.0,4.0,5.0,1.0,5.0]]).T
df.columns=['mycat','var1','var2']
df.var1=df.var1.astype('int64')
df.var2=df.var2.astype('int64')
df

df

Now, if I try to do a group by

df.groupby('mycat', as_index=False).var1.count() 

Here is the error I get:


ValueError                                Traceback (most recent call last)
<ipython-input-383-27b244bccb8b> in <module>()
----> 1 df.groupby('mycat', as_index=False).var1.count()

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in count(self, axis)
    740 
    741     def count(self, axis=0):
--> 742         return self._count().astype('int64')
    743 
    744     def ohlc(self):

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error)
   2096 
   2097         mgr = self._data.astype(
-> 2098             dtype=dtype, copy=copy, raise_on_error=raise_on_error)
   2099         return self._constructor(mgr).__finalize__(self)
   2100 

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, **kwargs)
   2235 
   2236     def astype(self, dtype, **kwargs):
-> 2237         return self.apply('astype', dtype=dtype, **kwargs)
   2238 
   2239     def convert(self, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
   2190                                                  copy=align_copy)
   2191 
-> 2192             applied = getattr(b, f)(**kwargs)
   2193 
   2194             if isinstance(applied, list):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values)
    319     def astype(self, dtype, copy=False, raise_on_error=True, values=None):
    320         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 321                             values=values)
    322 
    323     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
    337             if values is None:
    338                 # _astype_nansafe works fine with 1-d only
--> 339                 values = com._astype_nansafe(self.values.ravel(), dtype, copy=True)
    340                 values = values.reshape(self.values.shape)
    341             newb = make_block(values,

/usr/local/lib/python2.7/dist-packages/pandas/core/common.pyc in _astype_nansafe(arr, dtype, copy)
   2410     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
   2411         # work around NumPy brokenness, #1987
-> 2412         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
   2413     elif issubclass(dtype.type, compat.string_types):
   2414         return lib.astype_str(arr.ravel()).reshape(arr.shape)

/usr/local/lib/python2.7/dist-packages/pandas/lib.so in pandas.lib.astype_intsafe (pandas/lib.c:13456)()

/usr/local/lib/python2.7/dist-packages/pandas/lib.so in util.set_value_at (pandas/lib.c:55994)()

ValueError: invalid literal for long() with base 10: 'bar'

When i set as_index=True, I get

df.groupby('mycat', as_index=True).var1.count()

df2

When I change the agg function and set_index=False, I get a weird result tooL

df.groupby('mycat', as_index=False).var1.agg(np.count_nonzero)

df3

UPDATE: Realized my last result was not counting correctly and am now thoroughly confused.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

show pd.show_versions()

@marctollin
Copy link
Author


INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.1
Cython: None
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: None
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

@marctollin
Copy link
Author

I updated my original comment--I realized my last example didn't make sense..was counting strangely (according to my intuition)

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

count has a different implementation that most other methods (e.g. they are in cython or are essentially a python loop over the groups). count you get for 'free' as it is available by definition when you group. The other routines handle nuiscance columns (e.g. trying to perform a numeric operation on a string column) by excluding them. count needs to do the same. This works for as_index=True because by definition the grouping column is excluded.

care to do a pull-request?

@jreback jreback added this to the 0.15.1 milestone Sep 24, 2014
@marctollin
Copy link
Author

Sorry, but I have no C programming experience, otherwise I would help. Hopefully other's find value in fixing this bug.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

no c involved

all python

@marctollin
Copy link
Author

Ok! me and @jcauteru will give it a try

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

gr8!

create a test to compare against an expected result
thrn step thru to figure out where it's going wrong

keep in mind their are lots of other tests which have to pass as well

@marctollin
Copy link
Author

FYI, I haven't forgotten about this. What's going wrong is that "astype('int64') is being applied to the nuisance columns (the strings). The bug can be fixed (at least for this small test case originally posted) by removing the requirement that the count is of the dtype int64 or, alternatively, by passing the function to _python_agg_general which iterates through everything except the exclusions in groupby.py.

Both of these fixes fail the nose tests (primarily AssertionError: attr is not equal [dtype]: dtype('float64') != dtype('int64')) so I'm exploring a different method, perhaps requiring int64 at a different point in routine. @jcauteru

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@livia-b
Copy link

livia-b commented Jul 25, 2015

I am experiencing a similar problem when the column used for groupby is of type float. No exception is raised, but the resulting column in casted to int64:

      bug_dataset = pd.DataFrame(np.array([[ 0.,  0. ], [-0.9, 1]]), columns = ['x','y'])
      print bug_dataset.groupby('x',as_index=False).count()

x y
0 0 1
1 0 1

My version is :
INSTALLED VERSIONS

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-86-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.18
numpy: 1.9.2
scipy: 0.9.0
statsmodels: 0.6.1
IPython: 3.0.0-b1
sphinx: None
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)

@livia-b
Copy link

livia-b commented Jul 25, 2015

Is this issue related to #10355 ?

@jreback
Copy link
Contributor

jreback commented Jul 26, 2015

they look similar. You want to take a crack at writing some tests and use the fix I suggested in that issue to see if it fixes?

@livia-b
Copy link

livia-b commented Jul 26, 2015

I will try but it will take some time.... I am still having problems with the test environment (I've cloned the repository and run the existing tests, before changing any code, and I get FAILED (SKIP=543, errors=1, failures=2) , now I am trying to checkout a release tag ). I am also using a less powerful computer for development.

@stephenpascoe
Copy link

I've been stung with this issue too. Running @livia-b's test on latest master gives a different error:

In [17]: df.groupby('mycat', as_index=False).var1.count()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-27b244bccb8b> in <module>()
----> 1 df.groupby('mycat', as_index=False).var1.count()

/Users/spascoe/git/pandas/pandas/core/groupby.pyc in count(self)
   3510         #import pdb; pdb.set_trace()
   3511 
-> 3512         return self._wrap_agged_blocks(data.items, list(blk))
   3513 
   3514 

/Users/spascoe/git/pandas/pandas/lib.pyx in pandas.lib.count_level_2d (pandas/lib.c:22921)()
   1271         with nogil:
   1272             for i from 0 <= i < n:
-> 1273                 for j from 0 <= j < k:
   1274                     counts[labels[i], j] += mask[i, j]
   1275 

TypeError: count_level_2d() got an unexpected keyword argument 'axis'

It works for as_index=True but this doesn't work. I thought this should return a dataframe counting var1 and var2:

In [18]: df.groupby('mycat').count()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-921f07800bb6> in <module>()
----> 1 df.groupby('mycat').count()

/Users/spascoe/git/pandas/pandas/core/groupby.pyc in count(self)
   3510         #import pdb; pdb.set_trace()
   3511 
-> 3512         return self._wrap_agged_blocks(data.items, list(blk))
   3513 
   3514 

/Users/spascoe/git/pandas/pandas/lib.pyx in pandas.lib.count_level_2d (pandas/lib.c:22921)()
   1271         with nogil:
   1272             for i from 0 <= i < n:
-> 1273                 for j from 0 <= j < k:
   1274                     counts[labels[i], j] += mask[i, j]
   1275 

TypeError: count_level_2d() got an unexpected keyword argument 'axis'

@stephenpascoe
Copy link

It seems to work for SeriesGroupBy objects but not DataFrameGroupBy. We only get a SeriesGroupBy if we select a column after using as_index=True:

>>> type(df.groupby('mycat').var1)
<class 'pandas.core.groupby.SeriesGroupBy'>

>>> type(df.groupby('mycat'))
<class 'pandas.core.groupby.DataFrameGroupBy'>
>>> type(df.groupby('mycat', as_index=False))
<class 'pandas.core.groupby.DataFrameGroupBy'>
>>> type(df.groupby('mycat', as_index=False).var1)
<class 'pandas.core.groupby.DataFrameGroupBy'>

@jreback
Copy link
Contributor

jreback commented Sep 30, 2015

These all are fixed in master (you need a very recent master)

In [11]: df=pd.DataFrame([['foo','foo','bar','bar','bar','oats'],[1.0,2.0,3.0,4.0,4.0,5.0],[2.0,3.0,4.0,5.0,1.0,5.0]]).T

In [12]: df.columns=['mycat','var1','var2']

In [13]: df.var1=df.var1.astype('int64')

In [14]: df.var2=df.var2.astype('int64')

In [15]: df
Out[15]: 
  mycat  var1  var2
0   foo     1     2
1   foo     2     3
2   bar     3     4
3   bar     4     5
4   bar     4     1
5  oats     5     5

In [17]: df.groupby('mycat').var1.count()
Out[17]: 
mycat
bar     3
foo     2
oats    1
Name: var1, dtype: int64

In [18]: pd.__version__
Out[18]: '0.17.0rc1+115.g274abee'

In [19]: df.groupby('mycat').count()
Out[19]: 
       var1  var2
mycat            
bar       3     3
foo       2     2
oats      1     1

xref #11079 (it was fixed in an earlier commit)

@jreback jreback closed this as completed Sep 30, 2015
@jreback jreback modified the milestones: 0.17.0, Next Major Release Sep 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants