Inconsistent result with applying function between dataframe/groupby apply and 0.12/0.13/master #6715

jorisvandenbossche · 2014-03-27T09:28:09Z

I stumbled upon an inconsistent behaviour in (groupby/dataframe) apply when updating some code from 0.12 to 0.13.

Say you have following custom function:

def P1(a):
    try:
        return np.percentile(a.dropna(), q=1)
    except:
        return np.nan

def P1_withouttry(a):
    return np.percentile(a.dropna(), q=1)

When you apply this function to a dataframe:

In [3]: df = pd.DataFrame({'col1':[1,2,3,4],'col2':[10,25,26,31],                   
     ...:                    'date':[dt.date(2013,2,10),dt.date(2013,2,10),dt.date(2013,2,11),dt.date(2013,2,11)]})

In [4]: df
Out[4]:
   col1  col2        date
0     1    10  2013-02-10
1     2    25  2013-02-10
2     3    26  2013-02-11
3     4    31  2013-02-11

In [136]: df.apply(P1)
Out[136]: 
col1     1.03
col2    10.45
date      NaN
dtype: float64

In [138]: df.apply(P1_withouttry)
Traceback (most recent call last):
  ...
TypeError: ("unsupported operand type(s) for *: 'datetime.date' and 'float'", u'occurred at index date')

this does work with P1, but not with P1_withouttry. So I constructed my original function with a try/except to be able to apply this on a dataframe with also non-numeric columns.

However, when applying this on a groupby, it does not work anymore like this:

In [6]: g = df.groupby('date')

In [7]: g.apply(P1)
Out[7]:
date
2013-02-10   NaN
2013-02-11   NaN
dtype: float64

In [8]: g.apply(P1_withouttry)
Traceback (most recent call last):
   ...
TypeError: can't compare datetime.date to long


In [8]: g.agg(P1)
Out[8]:
            col1  col2
date
2013-02-10   NaN   NaN
2013-02-11   NaN   NaN

In [143]: g.agg(P1_withouttry)
Out[143]: 
            col1   col2
date                   
2013-02-10  1.01  10.15
2013-02-11  3.01  26.05

So, with apply it does not work, with aggregate it does, but only the P1_withouttry that didn't work with df.apply().
When using g.agg([P1]) this does work again on master, but not with 0.13.1 (then it gives the same as g.agg(P1)), although this did work
in 0.12:

In [11]: g.agg([P1])
Out[11]:
            col1   col2
              P1     P1
date
2013-02-10  1.01  10.15
2013-02-11  3.01  26.05

It was this last pattern I was using in my code in 0.12 that does not work anymore in 0.13.1 (I had something like g.agg([P1, P5, P10, P25, np.median, np.mean])).

The text was updated successfully, but these errors were encountered:

jreback · 2014-03-27T12:06:06Z

this is actually quite complicated why it does this. It is trying to detect whether the passed function modifies the input data, which is basically if the returned shape/index is different OR an exception is raised (this is done in cython). In which case it falls back to essentially a loop where the function is tried. An exception here WILL raise though.

So the first application succeeds because of the try: except: as you trap the error, the 2nd raises at the appropriate place.

I am not sure that I can fix easily. I have something that sort of works, but breaks other things which are 'supposed' to raise, so sorting them will take some effort.

jreback · 2014-03-27T12:16:39Z

@jorisvandenbossche

It appears that the results are identical in master and 0.12.

So what is exactly the issue here?

jorisvandenbossche · 2014-03-27T12:27:19Z

Yes, when using g.agg([P1]) it does work in both 0.12 and master (and not in 0.13, because of which I was looking at this). But I also would expect it to work with g.agg(P1) (without the square brackets) and with g.apply(P1)?

Eg with apply it does not work, but on a specific group it does:

In [12]: g.apply(P1)
Out[12]: 
date
2013-02-10   NaN
2013-02-11   NaN
dtype: float64

In [13]: g.get_group(dt.date(2013,2,10)).apply(P1)
Out[13]: 
col1     1.01
col2    10.15
date      NaN
dtype: float64

and on one column it also works:

In [21]: g['col1'].apply(P1)
Out[21]: 
date
2013-02-10    1.01
2013-02-11    3.01
Name: col1, dtype: float64

also with aggregate you get a strange result:

In [20]: g.agg(P1)
Out[20]: 
            col1  col2
date                  
2013-02-10   NaN   NaN
2013-02-11   NaN   NaN

jreback · 2014-03-27T12:52:54Z

IIRC that their were some bugs fixed since 0.13.1 related to this.

the last one is a bug I think....ok

jorisvandenbossche · 2014-03-27T13:21:43Z

The output I showed was all from current master

jreback added Bug labels Mar 27, 2014

jreback added this to the 0.14.0 milestone Mar 27, 2014

jreback mentioned this issue Mar 27, 2014

BUG: Bug in consistency of groupby aggregation when passing a custom function (GH6715) #6718

Merged

jreback closed this as completed in #6718 Mar 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent result with applying function between dataframe/groupby apply and 0.12/0.13/master #6715

Inconsistent result with applying function between dataframe/groupby apply and 0.12/0.13/master #6715

jorisvandenbossche commented Mar 27, 2014

jreback commented Mar 27, 2014

jreback commented Mar 27, 2014

jorisvandenbossche commented Mar 27, 2014

jreback commented Mar 27, 2014

jorisvandenbossche commented Mar 27, 2014

Inconsistent result with applying function between dataframe/groupby apply and 0.12/0.13/master #6715

Inconsistent result with applying function between dataframe/groupby apply and 0.12/0.13/master #6715

Comments

jorisvandenbossche commented Mar 27, 2014

jreback commented Mar 27, 2014

jreback commented Mar 27, 2014

jorisvandenbossche commented Mar 27, 2014

jreback commented Mar 27, 2014

jorisvandenbossche commented Mar 27, 2014