Skip to content

BUG: GroupBy.count() with float32 data type does not exclude nan #8171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 6, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/v0.15.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -669,3 +669,4 @@ Bug Fixes
was a tuple (:issue:`8121`).

- Bug with kde plot and NaNs (:issue:`8182`)
- Bug in ``GroupBy.count`` with float32 data type were nan values were not excluded (:issue:`8169`).
12 changes: 5 additions & 7 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,10 +147,7 @@ def _last(x):


def _count_compat(x, axis=0):
try:
return x.size
except:
return x.count()
return x.count() # .size != .count(); count excludes nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems strange that this used to try size first (I thought I commented earlier but guess not), how did this work before??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only gets hit (the compat) if we don't have the cython defined (eg it's not object/int64/float64) - so rarely gets hit (maybe just should always upcast) instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, have a feeling this exception would never raise (if it did wouldn't there be inf. recursion?? as count calls _count_compat...), I don't follow this code at all (need tests + coverage!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens is that this line produces completely valid result, but with float32 type, which causes type error in this line which expects float64, but the catch phrase also throws because float32 array is not of type object either. i.e. neither of lib.row_bool_subset and lib.row_bool_subset_object can handle an array of type float32, and even though the result array is already correct, there will be an exception here.

This ultimately falls back on _count_compat which produces the wrong result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri just for completeness: I was asking if you had just changed this line what would happen? (I'll just try it - I have a few pandas PRs to do anyway!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hayd if i had just changed that line, the result would be correct, but recalculated twice, once in here and once in _count_compat


class Grouper(object):
"""
Expand Down Expand Up @@ -1527,14 +1524,15 @@ def aggregate(self, values, how, axis=0):

result = self._aggregate(result, counts, values, how, is_numeric)

if self._filter_empty_groups:
if self._filter_empty_groups and not counts.all():
if result.ndim == 2:
try:
result = lib.row_bool_subset(
result, (counts > 0).view(np.uint8))
except ValueError:
result = lib.row_bool_subset_object(
result, (counts > 0).view(np.uint8))
com._ensure_object(result),
(counts > 0).view(np.uint8))
else:
result = result[counts > 0]

Expand Down Expand Up @@ -2477,7 +2475,7 @@ def _cython_agg_blocks(self, how, numeric_only=True):
values = block._try_operate(block.values)

if block.is_numeric:
values = com.ensure_float(values)
values = _algos.ensure_float64(values)

result, _ = self.grouper.aggregate(values, how, axis=agg_axis)

Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2214,6 +2214,20 @@ def test_count_object(self):
expected = pd.Series([1, 3], index=[2, 3], name='a')
tm.assert_series_equal(result, expected)

def test_count_cross_type(self): # GH8169
vals = np.hstack((np.random.randint(0,5,(100,2)),
np.random.randint(0,2,(100,2))))

df = pd.DataFrame(vals, columns=['a', 'b', 'c', 'd'])
df[df==2] = np.nan
expected = df.groupby(['c', 'd']).count()

for t in ['float32', 'object']:
df['a'] = df['a'].astype(t)
df['b'] = df['b'].astype(t)
result = df.groupby(['c', 'd']).count()
tm.assert_frame_equal(result, expected)

def test_non_cython_api(self):

# GH5610
Expand Down