Skip to content

BUG: DataFrame.agg({'col': 'size'}) not working #16405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chmp opened this issue May 21, 2017 · 4 comments · Fixed by #16458
Closed

BUG: DataFrame.agg({'col': 'size'}) not working #16405

chmp opened this issue May 21, 2017 · 4 comments · Fixed by #16458
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@chmp
Copy link

chmp commented May 21, 2017

size aggregation does not seem to work properly when used with ungrouped dataframes. When replacing size with count the examples run through. However, this change null semantics, counting nulls (size) in contrast to skipping nulls (count). The examples below are for Pandas version 0.20.1.

When used with .groupby(...), size aggregation works as expected:

>>> pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, None]}).groupby('g').agg({'v': 'size'})
   v
g   
0  2
1  1

Without .groupby, it raises an recursion error:

>>> pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, None]}).agg({'v': 'size'})
---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-15-5fc5c38cdb5f> in <module>()
----> 1 pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, 3]}).agg({'v': 'size'})

/Volumes/Home/venvs/py3-data-science/lib/python3.5/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
   4250                 pass
   4251         if result is None:
-> 4252             return self.apply(func, axis=axis, args=args, **kwargs)
   4253         return result
   4254 

/Volumes/Home/venvs/py3-data-science/lib/python3.5/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4322         # dispatch to agg
   4323         if axis == 0 and isinstance(func, (list, dict)):
-> 4324             return self.aggregate(func, axis=axis, *args, **kwds)
   4325 
   4326         if len(self.columns) == 0 and len(self.index) == 0:

... last 2 frames repeated, from the frame below ...

/Volumes/Home/venvs/py3-data-science/lib/python3.5/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
   4250                 pass
   4251         if result is None:
-> 4252             return self.apply(func, axis=axis, args=args, **kwargs)
   4253         return result
   4254 

RecursionError: maximum recursion depth exceeded
@jreback
Copy link
Contributor

jreback commented May 21, 2017

hmm, that doesn't seem friendly. I think the issue is that size needs to be special cases in .agg because its a property on a Series/DataFarme and a function in .groupby.

@jreback jreback added Bug Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 21, 2017
@jreback jreback added this to the Next Major Release milestone May 21, 2017
@jreback jreback changed the title DataFrame.agg({'col': 'size'}) not working BUG: DataFrame.agg({'col': 'size'}) not working May 21, 2017
@pvomelveny
Copy link
Contributor

pvomelveny commented May 22, 2017

Hey there, I started looking into this as part of the PyCon sprint.

Using the example:

>>> pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, None]}).agg({'v': 'size'})

I tracked down where the error that gets passed by is. In pandas/core/base.py:

def _try_aggregate_string_function(self, arg, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, compat.string_types)

        f = getattr(self, arg, None)
        if f is not None:
            return f(*args, **kwargs)

        f = getattr(np, arg, None)
        if f is not None:
            return f(self, *args, **kwargs)

        raise ValueError("{} is an unknown string function".format(arg))

The arg in this case is 'size', which is an actual attribute of the Series, just not, as expected, a callable one. A TypeError then bubbles up back to the original pandas/core/frame.py's aggregate, which is caught and passed over. Frame.aggregate then calls Frame.apply, which immediately deploys back to Frame.aggregate (and thus, our eventual recursion error):

 def apply(self, func, axis=0, broadcast=False, raw=False, reduce=None,
              args=(), **kwds):
        """
        Applies function along input axis of DataFrame.

        ...

        """
        axis = self._get_axis_number(axis)
        ignore_failures = kwds.pop('ignore_failures', False)

        # dispatch to agg
        if axis == 0 and isinstance(func, (list, dict)):
            return self.aggregate(func, axis=axis, *args, **kwds)

        ...

It seems like a good place to cut this off may be either _try_aggregate_string_function in base.py or in aggregate in series.py. Any executive decisions on this @TomAugspurger ? Left entierly to my own devices I'd probably do something along the lines:

def _try_aggregate_string_function(self, arg, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, compat.string_types)

        f = getattr(self, arg, None)
        if f is not None:
            return f(*args, **kwargs) if callable(f) else f  # Do people still hate the ternary operator?

        ...

But since this is actually in the base class, there could be some unforeseen consequences of this I can't see? (And barring those consequences, should this should just be restricted to calls of 'size'? Otherwise people could do potentially-unexpected-but-also-kinda-cool things like call df.agg({v:'ndim', w:'name', x:'data'}))

Appreciate any thoughts, thanks!

@jreback
Copy link
Contributor

jreback commented May 22, 2017

@pvomelveny what you did is reasonable, but let's write it out (yes hate the ternary operator :>)

maybe

if f is not None:
    if callable(f):
        return f(*args, **kwargs)
   assert len(args) == 0 and len(kwargs) == 0
   return f

and see if things pass (and add a comment about why doing this).

@TomAugspurger
Copy link
Contributor

@pvomelveny can you submit a pull request with those changes? I'm curious to see if the tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants