Skip to content

BUG: DataFrame.agg - why numpy.size doesn't work? #42203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
aliceliu9988 opened this issue Jun 24, 2021 · 7 comments
Open
1 task

BUG: DataFrame.agg - why numpy.size doesn't work? #42203

aliceliu9988 opened this issue Jun 24, 2021 · 7 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug

Comments

@aliceliu9988
Copy link

  • [ x] I have checked that this issue has not already been reported.

  • [ x] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                  columns=['A', 'B', 'C'])

df.agg({'A':['mean','std','size']})

import numpy as np
#Somehow this just doesn't work with DF.agg but works with DFGroupby.agg
df.agg({'A':[np.mean,np.std,np.size]})

Problem description

Intuitively, I assumed df.agg({'A':[np.mean,np.std,np.size]}) should work as df.agg({'A':['mean','std','size']}) does, but it doesn't. I wonder why? Looked through docs like the below but still didn't get it:

Expected Output

A

4.0
3.0
4.0

####

Output of *df.agg({'A':[np.mean,np.std,np.size]})


TypeError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\base.py in _aggregate_multiple_funcs(self, arg, _axis)
553 try:
--> 554 return concat(results, keys=keys, axis=1, sort=False)
555 except TypeError:

~\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
280 copy=copy,
--> 281 sort=sort,
282 )

~\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in init(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
356 )
--> 357 raise TypeError(msg)
358

TypeError: cannot concatenate object of type '<class 'float'>'; only Series and DataFrame objs are valid

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
1 import numpy as np
----> 2 df.agg({'A':[np.mean,np.std,np.size]})

~\anaconda3\lib\site-packages\pandas\core\frame.py in aggregate(self, func, axis, *args, **kwargs)
6704 result = None
6705 try:
-> 6706 result, how = self._aggregate(func, axis=axis, *args, **kwargs)
6707 except TypeError:
6708 pass

~\anaconda3\lib\site-packages\pandas\core\frame.py in _aggregate(self, arg, axis, *args, **kwargs)
6718 result = result.T if result is not None else result
6719 return result, how
-> 6720 return super()._aggregate(arg, *args, **kwargs)
6721
6722 agg = aggregate

~\anaconda3\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
426
427 try:
--> 428 result = _agg(arg, _agg_1dim)
429 except SpecificationError:
430

~\anaconda3\lib\site-packages\pandas\core\base.py in _agg(arg, func)
393 result = {}
394 for fname, agg_how in arg.items():
--> 395 result[fname] = func(fname, agg_how)
396 return result
397

~\anaconda3\lib\site-packages\pandas\core\base.py in _agg_1dim(name, how, subset)
377 "nested dictionary is ambiguous in aggregation"
378 )
--> 379 return colg.aggregate(how)
380
381 def _agg_2dim(name, how):

~\anaconda3\lib\site-packages\pandas\core\series.py in aggregate(self, func, axis, *args, **kwargs)
3686 # Validate the axis parameter
3687 self._get_axis_number(axis)
-> 3688 result, how = self._aggregate(func, *args, **kwargs)
3689 if result is None:
3690

~\anaconda3\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
484 elif is_list_like(arg):
485 # we require a list, but not an 'str'
--> 486 return self._aggregate_multiple_funcs(arg, _axis=_axis), None
487 else:
488 result = None

~\anaconda3\lib\site-packages\pandas\core\base.py in _aggregate_multiple_funcs(self, arg, _axis)
562 result = Series(results, index=keys, name=self.name)
563 if is_nested_object(result):
--> 564 raise ValueError("cannot combine transform and aggregation operations")
565 return result
566

ValueError: cannot combine transform and aggregation operations

@aliceliu9988 aliceliu9988 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2021
@attack68
Copy link
Contributor

Actually this works:

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                  columns=['A', 'B', 'C'])

df.agg({'A':['mean','std']})
df.agg({'A':[np.mean,np.std]})

The only thing relevant to your issue is:

df.agg({'A':['size']})
df.agg({'A':[np.size]})

@attack68 attack68 added Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2021
@attack68
Copy link
Contributor

@aliceliu9988 can yo please add a descriptive title for the issue.

@aliceliu9988 aliceliu9988 changed the title BUG: BUG: DataFrame.agg - why numpy.size doesn't work? Jun 24, 2021
@aliceliu9988
Copy link
Author

aliceliu9988 commented Jun 24, 2021

Hi @attack68,

Thanks for helping pinpoint the problem and reminding me to update the question title.

I tried this: print("df.agg({'A': [np.size]}) is :",df.agg({'A':[np.size]}))

It did go through, but the output is not the row count of column A (3), but this:

df.agg({'A': [np.size]}) is : A
size
0 1
1 1
2 1
3 1

I hope someone knows why.

@rhshadrach
Copy link
Member

Internally, np.size is evaluated on a Series. For a UDF, .agg uses .apply which then operates row-by-row. This makes the result a transform, and the others being aggregations caused the error mentioned in the OP.

.agg using .apply being undesirable is mentioned as part of #41112, but I think this issue should stay open.

@aliceliu9988
Copy link
Author

Hi Richard @rhshadrach,

Thanks for your explanation. I am a beginner of Python and do appreciate your hints and will look more into the difference between 'transform' and 'aggregation. Ah... I have to say as a beginner I didn't expect the syntax to behave inconsistently like this.

@rhshadrach
Copy link
Member

@aliceliu9988

I have to say as a beginner I didn't expect the syntax to behave inconsistently like this.

I realize now I wasn't very clear, but I was trying to say the same thing! Thanks for raising this issue.

@kwhkim
Copy link
Contributor

kwhkim commented Aug 30, 2021

Wow, this looks serious. I have another example.

>>> df.agg(np.size)
A    3
B    3
C    3
dtype: int64
>>> df.agg({'A':np.size})
   A
0  1
1  1
2  1

so df.agg({'A':}) is more like df.A.agg()?

>>> df.A.agg(np.size)
0    1
1    1
2    1
Name: A, dtype: int64

It gets weirder

>>> df.groupby([1]*len(df)).agg({'A':np.size})  # [1]*len(df) makes the whole rows as group 1
   A
1  3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug
Projects
None yet
Development

No branches or pull requests

4 participants