Skip to content

ENH: groupby().apply(f) accepts combine=0 arg, to return results unmolested #3241 #3242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Apr 2, 2013

#3241

Right now:

In [16]: df=mkdf(10,2,data_gen_f=lambda x,y: randint(1,10))
    ...: df
    ...: 
    ...: 
    ...: 
Out[16]: 
C0       C_l0_g0  C_l0_g1
R0                       
R_l0_g0        9        1
R_l0_g1        3        7
R_l0_g2        8        1
R_l0_g3        4        3
R_l0_g4        5        3
R_l0_g5        7        2
R_l0_g6        4        1
R_l0_g7        5        4
R_l0_g8        9        7
R_l0_g9        4        8

In [17]: def f1(g):
    ...:     return g.sort('C_l0_g0')
    ...: # group on the suffix of the running index 
    ...: g=df.groupby(lambda key: int(key.split("g")[-1]) >= 5)
    ...: r=g.apply(f1)
    ...: 

# we want to return each group dataframe sorted, but we get concatted against our will
In [18]: r
Out[18]: 
C0             C_l0_g0  C_l0_g1
      R0                       
False R_l0_g1        3        7
      R_l0_g3        4        3
      R_l0_g4        5        3
      R_l0_g2        8        1
      R_l0_g0        9        1
True  R_l0_g6        4        1
      R_l0_g9        4        8
      R_l0_g7        5        4
      R_l0_g5        7        2
      R_l0_g8        9        7

# what we want really, is a couple of sorted dataframes:

In [20]: map(lambda r: r[1].sort('C_l0_g0'),g)
Out[20]: 
[C0       C_l0_g0  C_l0_g1
R0                       
R_l0_g1        3        7
R_l0_g3        4        3
R_l0_g4        5        3
R_l0_g2        8        1
R_l0_g0        9        1,
 C0       C_l0_g0  C_l0_g1
R0                       
R_l0_g6        4        1
R_l0_g9        4        8
R_l0_g7        5        4
R_l0_g5        7        2
R_l0_g8        9        7]

With this PR:

In [21]: def f1(g): # same f1 as above
    ...:     return g.sort('C_l0_g0')
    ...: def f2(g,raw=None):
    ...:     return g.sort('C_l0_g0')
    ...: def f3(g,**kwds):
    ...:     return g.sort('C_l0_g0')
    ...: # the  `raw` keyword is the new bit
    ...: r1=g.apply(f1,raw=True)
    ...: r2=g.apply(f2,raw=True)
    ...: r3=g.apply(f2,raw=True)
    ...: 
# a bunch of sorted frames
In [22]: print r1
[(False, C0       C_l0_g0  C_l0_g1
R0                       
R_l0_g1        3        7
R_l0_g3        4        3
R_l0_g4        5        3
R_l0_g2        8        1
R_l0_g0        9        1), (True, C0       C_l0_g0  C_l0_g1
R0                       
R_l0_g6        4        1
R_l0_g9        4        8
R_l0_g7        5        4
R_l0_g5        7        2
R_l0_g8        9        7)]

# but not if the transformer function signature uses **kwds, or 'raw' already
In [23]: print r2
C0             C_l0_g0  C_l0_g1
      R0                       
False R_l0_g1        3        7
      R_l0_g3        4        3
      R_l0_g4        5        3
      R_l0_g2        8        1
      R_l0_g0        9        1
True  R_l0_g6        4        1
      R_l0_g9        4        8
      R_l0_g7        5        4
      R_l0_g5        7        2
      R_l0_g8        9        7

In [24]: print r3
C0             C_l0_g0  C_l0_g1
      R0                       
False R_l0_g1        3        7
      R_l0_g3        4        3
      R_l0_g4        5        3
      R_l0_g2        8        1
      R_l0_g0        9        1
True  R_l0_g6        4        1
      R_l0_g9        4        8
      R_l0_g7        5        4
      R_l0_g5        7        2
      R_l0_g8        9        7

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

doesn't apply already have a raw argument?

    def apply(self, func, axis=0, broadcast=False, raw=False,
              args=(), **kwds):
        """
        Applies function along input axis of DataFrame. Objects passed to
        functions are Series objects having index either the DataFrame's index
        (axis=0) or the columns (axis=1). Return type depends on whether passed
        function aggregates

        Parameters
        ----------
        func : function
            Function to apply to each column
        axis : {0, 1}
            0 : apply function to each column
            1 : apply function to each row
        broadcast : bool, default False
            For aggregation functions, return object of same size with values
            propagated
        raw : boolean, default False
            If False, convert each row or column into a Series. If raw=True the
            passed function will receive ndarray objects instead. If you are
            just applying a NumPy reduction function this will achieve much
            better performance
        args : tuple
            Positional arguments to pass to function in addition to the
            array/series
        Additional keyword arguments will be passed as keywords to the function

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

sorry....you mean the groupby apply....retract my comment

@ghost
Copy link
Author

ghost commented Apr 2, 2013

No, that's a good comment, I wasn't aware of that arg which has a different meaning.
Would prefer to have consistent arg names across pandas, if raw means something else in
another apply function it's probably better to find another name. Are you aware of similar
functionality + name elsewhere in the API?

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

NTMK, maybe rename your to combine=True ? (and the behavior you exibit be combine=False)

@ghost
Copy link
Author

ghost commented Apr 2, 2013

sold.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

this looks good....maybe add to whatsnew/docs? (as its an interesting case)

@ghost
Copy link
Author

ghost commented Apr 2, 2013

will do, but in 0.12. I'm not putting in anything new at this point in the release cycle.

@ghost
Copy link
Author

ghost commented Apr 2, 2013

Now that I think of it, would be partly mitigated by a df.split_on(nlevels=1) or something similar
(related #3066), although this is still useful I think.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

yep....

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@y-p forgot you did this....hmm....let's resurrect in 0.14.....

@@ -307,6 +307,9 @@ def apply(self, func, *args, **kwargs):
Parameters
----------
func : function
combine : (default: True), You may pass in a combine=True argument to get back
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be combine=False

@ghost
Copy link
Author

ghost commented Dec 19, 2013

like #5655 I'm hesitatnt because this jumps through hoops to compensate for a fundamentally problematic
choice that's too established to correct (optionated apply - meh, but capturing all kwds to prevent extentions
to the signature is nasty).

Still, I think this is solid and i'm for brushing off the dust and merging in 0.14.
Perhaps a nice synergy with #4059 (comment) if it makes it in as well.

@ghost ghost closed this Jan 26, 2014
@ghost ghost deleted the feature/groupby_apply_raw_mode branch January 26, 2014 21:32
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant