Skip to content

Inconsistent result with applying function between dataframe/groupby apply and 0.12/0.13/master #6715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Mar 27, 2014 · 5 comments · Fixed by #6718
Milestone

Comments

@jorisvandenbossche
Copy link
Member

I stumbled upon an inconsistent behaviour in (groupby/dataframe) apply when updating some code from 0.12 to 0.13.

Say you have following custom function:

def P1(a):
    try:
        return np.percentile(a.dropna(), q=1)
    except:
        return np.nan

def P1_withouttry(a):
    return np.percentile(a.dropna(), q=1)

When you apply this function to a dataframe:

In [3]: df = pd.DataFrame({'col1':[1,2,3,4],'col2':[10,25,26,31],                   
     ...:                    'date':[dt.date(2013,2,10),dt.date(2013,2,10),dt.date(2013,2,11),dt.date(2013,2,11)]})

In [4]: df
Out[4]:
   col1  col2        date
0     1    10  2013-02-10
1     2    25  2013-02-10
2     3    26  2013-02-11
3     4    31  2013-02-11

In [136]: df.apply(P1)
Out[136]: 
col1     1.03
col2    10.45
date      NaN
dtype: float64

In [138]: df.apply(P1_withouttry)
Traceback (most recent call last):
  ...
TypeError: ("unsupported operand type(s) for *: 'datetime.date' and 'float'", u'occurred at index date')

this does work with P1, but not with P1_withouttry. So I constructed my original function with a try/except to be able to apply this on a dataframe with also non-numeric columns.

However, when applying this on a groupby, it does not work anymore like this:

In [6]: g = df.groupby('date')

In [7]: g.apply(P1)
Out[7]:
date
2013-02-10   NaN
2013-02-11   NaN
dtype: float64

In [8]: g.apply(P1_withouttry)
Traceback (most recent call last):
   ...
TypeError: can't compare datetime.date to long


In [8]: g.agg(P1)
Out[8]:
            col1  col2
date
2013-02-10   NaN   NaN
2013-02-11   NaN   NaN

In [143]: g.agg(P1_withouttry)
Out[143]: 
            col1   col2
date                   
2013-02-10  1.01  10.15
2013-02-11  3.01  26.05

So, with apply it does not work, with aggregate it does, but only the P1_withouttry that didn't work with df.apply().
When using g.agg([P1]) this does work again on master, but not with 0.13.1 (then it gives the same as g.agg(P1)), although this did work
in 0.12:

In [11]: g.agg([P1])
Out[11]:
            col1   col2
              P1     P1
date
2013-02-10  1.01  10.15
2013-02-11  3.01  26.05

It was this last pattern I was using in my code in 0.12 that does not work anymore in 0.13.1 (I had something like g.agg([P1, P5, P10, P25, np.median, np.mean])).

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

this is actually quite complicated why it does this. It is trying to detect whether the passed function modifies the input data, which is basically if the returned shape/index is different OR an exception is raised (this is done in cython). In which case it falls back to essentially a loop where the function is tried. An exception here WILL raise though.

So the first application succeeds because of the try: except: as you trap the error, the 2nd raises at the appropriate place.

I am not sure that I can fix easily. I have something that sort of works, but breaks other things which are 'supposed' to raise, so sorting them will take some effort.

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

@jorisvandenbossche

It appears that the results are identical in master and 0.12.

So what is exactly the issue here?

@jorisvandenbossche
Copy link
Member Author

Yes, when using g.agg([P1]) it does work in both 0.12 and master (and not in 0.13, because of which I was looking at this). But I also would expect it to work with g.agg(P1) (without the square brackets) and with g.apply(P1)?

Eg with apply it does not work, but on a specific group it does:

In [12]: g.apply(P1)
Out[12]: 
date
2013-02-10   NaN
2013-02-11   NaN
dtype: float64

In [13]: g.get_group(dt.date(2013,2,10)).apply(P1)
Out[13]: 
col1     1.01
col2    10.15
date      NaN
dtype: float64

and on one column it also works:

In [21]: g['col1'].apply(P1)
Out[21]: 
date
2013-02-10    1.01
2013-02-11    3.01
Name: col1, dtype: float64

also with aggregate you get a strange result:

In [20]: g.agg(P1)
Out[20]: 
            col1  col2
date                  
2013-02-10   NaN   NaN
2013-02-11   NaN   NaN

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

IIRC that their were some bugs fixed since 0.13.1 related to this.

the last one is a bug I think....ok

@jorisvandenbossche
Copy link
Member Author

The output I showed was all from current master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants