Skip to content

Bug when combining .groupby() apply with .expanding() apply #12829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lminer opened this issue Apr 8, 2016 · 9 comments
Closed

Bug when combining .groupby() apply with .expanding() apply #12829

lminer opened this issue Apr 8, 2016 · 9 comments

Comments

@lminer
Copy link

lminer commented Apr 8, 2016

In this example, the aim is to use an expanding window to create an expanding count, by group, of the occurrence of a predetermined set of strings. Seemed like there might be some sort of bug in the performance of expanding when combined with groupby and apply.

In this case the strings are ['tito', 'bar', 'feep']

              category      group
2000-01-01  'foo'            a
2000-01-02  'tito'           a
2000-01-03  'bar'            a
2000-01-04  'zip'            b
2000-01-05  'zorp'           b
2000-01-03  'feep'           c

So this would become:

              category      group    count
2000-01-01  'foo'            a            0
2000-01-02  'tito'           a            1
2000-01-03  'bar'            a            2
2000-01-04  'zip'            b            0
2000-01-05  'zorp'           b            0
2000-01-03  'feep'           c            1

However, when I run the following code, it's just the category column that gets returned as count. The same thing happens when I use window in the place of expanding.

from operator import or_

df = pd.DataFrame({'category':['foo', 'tito', 'bar', 'zip', 'zorp', 'feep'],                   
                                    'group': ['a', 'a', 'a', 'b', 'b', 'c']},
                                    index=pd.to_datetime(['2000-01-01', '2000-01-02', '2000-01-03',
                                                                          '2000-01-04', '2000-01-05', '2000-01-03']))

def count_categories(ser):

    categories_to_count = ['tito',
                           'bar',
                           'feep']

    conditions = [ser == val for val in categories_to_count]
    mask = reduce(or_, conditions)
    return mask.sum()


def expanding_count_categories(s):
    return s.expanding().apply(count_categories)

df.groupby('group')['category'].apply(expanding_count_categories)

>> '2000-01-01'     foo
>> '2000-01-02'    tito
>> '2000-01-03'     bar
>> '2000-01-04'     zip
>> '2000-01-05'    zorp
>> '2000-01-03'    feep
>> dtype: object

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-76-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.1
pip: 8.0.2
setuptools: 19.1.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.2
sphinx: 1.2.2
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None

@jreback
Copy link
Contributor

jreback commented Apr 8, 2016

this is an extremely weird thing to do (and completely non-performant), keeping tuples in columns. do something like this:

In [37]: cats = df.category.apply(Series)

In [38]: cats.columns = ['one','two']

In [39]: df2 = pd.concat([df[['group']],cats],axis=1)

In [40]: df2
Out[40]: 
           group   one     two
2000-01-01     a   foo     NaN
2000-01-02     a  tito  puente
2000-01-03     a   bar     NaN
2000-01-04     b   zip     NaN
2000-01-05     b  zorp     NaN
2000-01-03     c  feep     NaN

In [41]: df2.set_index('group').unstack().dropna()
Out[41]: 
     group
one  a           foo
     a          tito
     a           bar
     b           zip
     b          zorp
     c          feep
two  a        puente
dtype: object

@lminer
Copy link
Author

lminer commented Apr 8, 2016

Sorry there shouldn't have been any tuples in the columns. I've changed it all to strings. The problem is with the attempt to use a window method to count the occurrences of these strings. The code snipped I posted should be returning counts, not strings.

@jreback
Copy link
Contributor

jreback commented Apr 8, 2016

In [14]: df['category2'] = df['category'].astype('category').cat.codes

In [15]: df.groupby('group').category2.apply(lambda x: x.expanding().count())
Out[15]: 
2000-01-01    1.0
2000-01-02    2.0
2000-01-03    3.0
2000-01-04    1.0
2000-01-05    2.0
2000-01-03    1.0
Name: category2, dtype: float64

# this will work in 0.18.1
In [16]: df.groupby('group').category2.expanding().count().astype(int)
Out[16]: 
group            
a      2000-01-01    1
       2000-01-02    2
       2000-01-03    3
b      2000-01-04    1
       2000-01-05    2
c      2000-01-03    1
Name: category2, dtype: int64

@lminer
Copy link
Author

lminer commented Apr 8, 2016

Thanks, but I'm trying to only count rows including a string in the list ['tito', 'bar', 'feep']. It does seem like unexpected behavior that the approach I'm using isn't even returning numbers.

@jreback
Copy link
Contributor

jreback commented Apr 8, 2016

then just pre-filter first.

@lminer
Copy link
Author

lminer commented Apr 8, 2016

It seems so easy once you say it.

@jreback
Copy link
Contributor

jreback commented Apr 8, 2016

.expanding does not handle non-numerics ATM #12541

@jreback
Copy link
Contributor

jreback commented Apr 8, 2016

just df[df.category.isin([.....])]

@lminer
Copy link
Author

lminer commented Apr 8, 2016

Ah! All is clear. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants