-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Problem with count aggregation of a boolean column #3752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a 'side' effect of pandas trying to coerce the output of groupby back to the same data type as the input (if possible); since you are using a user defined function, it is impossible to disambuigate this case. You are after all grouping on a boolean column. I agree this is a somewhat degenerate case, and I suppose we maybe should not coerce on a boolean column at all (in a groupby). Can you give me some more context on what you are trying to do? Is the following is what you are after?
|
The main issue I am running into is that I am doing multiple aggregations, which is similar to the following (Note: I did not compile or run this. If you want me to make sure this is a working example, let me know): MyInputTuple = namedtuple('MyInputTuple', 'attr_0, attr_1, attr_2, success, value_average')
data_frame = DataFrame.from_records([MyInputTuple(0, 1, 2, True, 4.7)], columns=MyInputTuple._fields)
result = data_frame.groupby(['attr_0', 'attr_1', 'attr_2'], as_index=False)
.agg(OrderedDict([
('success', OrderedDict([
('num_tests', Series.count),
'num_failed', lambda x: x.count() - np.count_nonzero(x))
)),
('value_average', OrderedDict([
('min', np.min),
('max', np.max),
('avg', np.mean)
])),
]))
MyOutputTuple = namedtuple('MyOutputTuple', 'attr_1, attr_2, num_tests, num_failed, min_value_avg, max_value_avg, avg_value_avg')
for row in result.itertuples():
attr_1 = row[1]
output_tuple = row[2:]
yield attr_1, MyOutputTuple._make(output_tuple) |
I would just or you could easily handle this in a single function, e.g.
easier to read/understand too, my 2c |
closing in favor of #7001 |
The issue I am having is that in pandas 0.10.0, a count of boolean column with a single item in it would be 1, while a count of multiple items would of course be the number of items. In 0.11.0, it is now True if it is a single item, and the count otherwise.
Result in 0.10.00
In 0.11.0, this is now the result:
Test Code:
The text was updated successfully, but these errors were encountered: