Bug when combining .groupby() apply with .expanding() apply #12829

lminer · 2016-04-08T17:56:52Z

In this example, the aim is to use an expanding window to create an expanding count, by group, of the occurrence of a predetermined set of strings. Seemed like there might be some sort of bug in the performance of expanding when combined with groupby and apply.

In this case the strings are ['tito', 'bar', 'feep']

              category      group
2000-01-01  'foo'            a
2000-01-02  'tito'           a
2000-01-03  'bar'            a
2000-01-04  'zip'            b
2000-01-05  'zorp'           b
2000-01-03  'feep'           c

So this would become:

              category      group    count
2000-01-01  'foo'            a            0
2000-01-02  'tito'           a            1
2000-01-03  'bar'            a            2
2000-01-04  'zip'            b            0
2000-01-05  'zorp'           b            0
2000-01-03  'feep'           c            1

However, when I run the following code, it's just the category column that gets returned as count. The same thing happens when I use window in the place of expanding.

from operator import or_

df = pd.DataFrame({'category':['foo', 'tito', 'bar', 'zip', 'zorp', 'feep'],                   
                                    'group': ['a', 'a', 'a', 'b', 'b', 'c']},
                                    index=pd.to_datetime(['2000-01-01', '2000-01-02', '2000-01-03',
                                                                          '2000-01-04', '2000-01-05', '2000-01-03']))

def count_categories(ser):

    categories_to_count = ['tito',
                           'bar',
                           'feep']

    conditions = [ser == val for val in categories_to_count]
    mask = reduce(or_, conditions)
    return mask.sum()


def expanding_count_categories(s):
    return s.expanding().apply(count_categories)

df.groupby('group')['category'].apply(expanding_count_categories)

>> '2000-01-01'     foo
>> '2000-01-02'    tito
>> '2000-01-03'     bar
>> '2000-01-04'     zip
>> '2000-01-05'    zorp
>> '2000-01-03'    feep
>> dtype: object

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-76-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.1
pip: 8.0.2
setuptools: 19.1.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.2
sphinx: 1.2.2
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-08T18:21:38Z

this is an extremely weird thing to do (and completely non-performant), keeping tuples in columns. do something like this:

In [37]: cats = df.category.apply(Series)

In [38]: cats.columns = ['one','two']

In [39]: df2 = pd.concat([df[['group']],cats],axis=1)

In [40]: df2
Out[40]: 
           group   one     two
2000-01-01     a   foo     NaN
2000-01-02     a  tito  puente
2000-01-03     a   bar     NaN
2000-01-04     b   zip     NaN
2000-01-05     b  zorp     NaN
2000-01-03     c  feep     NaN

In [41]: df2.set_index('group').unstack().dropna()
Out[41]: 
     group
one  a           foo
     a          tito
     a           bar
     b           zip
     b          zorp
     c          feep
two  a        puente
dtype: object

lminer · 2016-04-08T18:26:28Z

Sorry there shouldn't have been any tuples in the columns. I've changed it all to strings. The problem is with the attempt to use a window method to count the occurrences of these strings. The code snipped I posted should be returning counts, not strings.

jreback · 2016-04-08T18:38:20Z

In [14]: df['category2'] = df['category'].astype('category').cat.codes

In [15]: df.groupby('group').category2.apply(lambda x: x.expanding().count())
Out[15]: 
2000-01-01    1.0
2000-01-02    2.0
2000-01-03    3.0
2000-01-04    1.0
2000-01-05    2.0
2000-01-03    1.0
Name: category2, dtype: float64

# this will work in 0.18.1
In [16]: df.groupby('group').category2.expanding().count().astype(int)
Out[16]: 
group            
a      2000-01-01    1
       2000-01-02    2
       2000-01-03    3
b      2000-01-04    1
       2000-01-05    2
c      2000-01-03    1
Name: category2, dtype: int64

lminer · 2016-04-08T18:42:25Z

Thanks, but I'm trying to only count rows including a string in the list ['tito', 'bar', 'feep']. It does seem like unexpected behavior that the approach I'm using isn't even returning numbers.

jreback · 2016-04-08T18:52:39Z

then just pre-filter first.

lminer · 2016-04-08T18:54:31Z

It seems so easy once you say it.

jreback · 2016-04-08T18:55:14Z

.expanding does not handle non-numerics ATM #12541

jreback · 2016-04-08T18:55:37Z

just df[df.category.isin([.....])]

lminer · 2016-04-08T18:59:58Z

Ah! All is clear. thanks!

jreback closed this as completed Apr 8, 2016

jreback added Groupby Usage Question labels Apr 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when combining .groupby() apply with .expanding() apply #12829

Bug when combining .groupby() apply with .expanding() apply #12829

lminer commented Apr 8, 2016

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016

jreback commented Apr 8, 2016

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016

Bug when combining .groupby() apply with .expanding() apply #12829

Bug when combining .groupby() apply with .expanding() apply #12829

Comments

lminer commented Apr 8, 2016

INSTALLED VERSIONS

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016

jreback commented Apr 8, 2016

jreback commented Apr 8, 2016

lminer commented Apr 8, 2016