Skip to content

BUG: Fix segfault in GroupBy.count and DataFrame.count #32842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 4, 2020
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ Numeric
- Bug in :meth:`to_numeric` with string argument ``"uint64"`` and ``errors="coerce"`` silently fails (:issue:`32394`)
- Bug in :meth:`to_numeric` with ``downcast="unsigned"`` fails for empty data (:issue:`32493`)
- Bug in :meth:`DataFrame.mean` with ``numeric_only=False`` and either ``datetime64`` dtype or ``PeriodDtype`` column incorrectly raising ``TypeError`` (:issue:`32426`)
-
- Bug in :meth:`DataFrame.count` with ``level="foo"`` and index level ``"foo"`` containing NaNs causes segmentation fault (:issue:`21824`)

Conversion
^^^^^^^^^^
Expand Down Expand Up @@ -361,6 +361,7 @@ Groupby/resample/rolling

- Bug in :meth:`GroupBy.apply` raises ``ValueError`` when the ``by`` axis is not sorted and has duplicates and the applied ``func`` does not mutate passed in objects (:issue:`30667`)
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
- Bug in :meth:`GroupBy.count` causes segmentation fault when grouped-by column contains NaNs (:issue:`32841`)

Reshaping
^^^^^^^^^
Expand Down
6 changes: 4 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -793,14 +793,16 @@ def count_level_2d(ndarray[uint8_t, ndim=2, cast=True] mask,
with nogil:
for i in range(n):
for j in range(k):
counts[labels[i], j] += mask[i, j]
if mask[i, j]:
counts[labels[i], j] += 1

else: # axis == 1
counts = np.zeros((n, max_bin), dtype='i8')
with nogil:
for i in range(n):
for j in range(k):
counts[i, labels[j]] += mask[i, j]
if mask[i, j]:
counts[i, labels[j]] += 1

return counts

Expand Down
28 changes: 12 additions & 16 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -7820,34 +7820,30 @@ def _count_level(self, level, axis=0, numeric_only=False):
f"Can only count levels on hierarchical {self._get_axis_name(axis)}."
)

if frame._is_mixed_type:
# Since we have mixed types, calling notna(frame.values) might
# upcast everything to object
mask = notna(frame).values
else:
# But use the speedup when we have homogeneous dtypes
mask = notna(frame.values)
# Mask NaNs: Mask rows or columns where the index level is NaN, and all
# values in the DataFrame that are NaN
values_mask = notna(frame.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the existing version with notna(frame).values wrong? avoiding .values with is_mixed_type is worthwhile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I understood @jreback 's earlier comment such that notna(frame.values) and notna(frame).values are the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those two are in fact the same, but performance should be better for notna(frame).values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running some tests shows that notna(df).values is indeed much faster on larger mixed type DataFrames. I will restore the original code.
I guess this optimisation would be better in notna.

In [1]: import pandas as pd                                                                                                                                                                   
In [2]: from pandas.core.dtypes.missing import notna                                                                                                                               
In [3]: df = pd.DataFrame({'a': range(10), 'b': ['foo'] * 10})                                                                                                                                     
In [4]: %timeit notna(df.values)                                                                                                                                                              
113 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit notna(df).values                                                                                                                                                              
546 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: df = pd.DataFrame({'a': range(1000000), 'b': ['foo'] * 1000000})                                                                                                                      
In [7]: %timeit notna(df.values)                                                                                                                                                              
163 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit notna(df).values                                                                                                                                                              
40.7 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit notna(df.values)                                                                                                                                                              
158 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit notna(df).values                                                                                                                                                             
39.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


index_mask = notna(count_axis.get_level_values(level=level))
if axis == 1:
# We're transposing the mask rather than frame to avoid potential
# upcasts to object, which induces a ~20x slowdown
mask = mask.T
mask = index_mask & values_mask
else:
mask = index_mask.reshape(-1, 1) & values_mask

if isinstance(level, str):
level = count_axis._get_level_number(level)

level_name = count_axis._names[level]
level_index = count_axis.levels[level]._shallow_copy(name=level_name)
level_codes = ensure_int64(count_axis.codes[level])
counts = lib.count_level_2d(mask, level_codes, len(level_index), axis=0)

result = DataFrame(counts, index=level_index, columns=agg_axis)
counts = lib.count_level_2d(mask, level_codes, len(level_index), axis=axis)

if axis == 1:
# Undo our earlier transpose
return result.T
result = DataFrame(counts, index=agg_axis, columns=level_index)
else:
return result
result = DataFrame(counts, index=level_index, columns=agg_axis)

return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended the masking logic in this function to mask all rows/columns that lib.count_level_2d requires to be masked to avoid out-of-bounds access.


def _reduce(
self, op, name, axis=0, skipna=True, numeric_only=None, filter_type=None, **kwds
Expand Down
11 changes: 10 additions & 1 deletion pandas/tests/groupby/test_counting.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import numpy as np
import pytest

from pandas import DataFrame, MultiIndex, Period, Series, Timedelta, Timestamp
from pandas import DataFrame, Index, MultiIndex, Period, Series, Timedelta, Timestamp
import pandas._testing as tm


Expand Down Expand Up @@ -220,3 +220,12 @@ def test_count_with_only_nans_in_first_group(self):
mi = MultiIndex(levels=[[], ["a", "b"]], codes=[[], []], names=["A", "B"])
expected = Series([], index=mi, dtype=np.int64, name="C")
tm.assert_series_equal(result, expected, check_index_type=False)

def test_count_groupby_column_with_nan_in_groupby_column(self):
# https://github.com/pandas-dev/pandas/issues/32841
df = DataFrame({"A": [1, 1, 1, 1, 1], "B": [5, 4, np.NaN, 3, 0]})
res = df.groupby(["B"]).count()
expected = DataFrame(
index=Index([0.0, 3.0, 4.0, 5.0], name="B"), data={"A": [1, 1, 1, 1]}
)
tm.assert_frame_equal(expected, res)
28 changes: 28 additions & 0 deletions pandas/tests/test_multilevel.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,34 @@ def _check_counts(frame, axis=0):
result = self.frame.count(level=0, numeric_only=True)
tm.assert_index_equal(result.columns, Index(list("ABC"), name="exp"))

def test_count_index_with_nan(self):
# https://github.com/pandas-dev/pandas/issues/21824
df = DataFrame(
{
"Person": ["John", "Myla", None, "John", "Myla"],
"Age": [24.0, 5, 21.0, 33, 26],
"Single": [False, True, True, True, False],
}
)

# count on row labels
res = df.set_index(["Person", "Single"]).count(level="Person")
expected = DataFrame(
index=Index(["John", "Myla"], name="Person"),
columns=Index(["Age"]),
data=[2, 2],
)
tm.assert_frame_equal(res, expected)

# count on column labels
res = df.set_index(["Person", "Single"]).T.count(level="Person", axis=1)
expected = DataFrame(
columns=Index(["John", "Myla"], name="Person"),
index=Index(["Age"]),
data=[[2, 2]],
)
tm.assert_frame_equal(res, expected)

def test_count_level_series(self):
index = MultiIndex(
levels=[["foo", "bar", "baz"], ["one", "two", "three", "four"]],
Expand Down