Skip to content

PERF: improves performance in SeriesGroupBy.count #10946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -814,6 +814,8 @@ Bug Fixes
- Bug in ``BinGrouper.group_info`` where returned values are not compatible with base class (:issue:`10914`)
- Bug in clearing the cache on ``DataFrame.pop`` and a subsequent inplace op (:issue:`10912`)
- Bug in indexing with a mixed-integer ``Index`` causing an ``ImportError`` (:issue:`10610`)
- Bug in ``Series.count`` when index has nulls (:issue:`10946`)

- Bug causing ``DataFrame.where`` to not respect the ``axis`` parameter when the frame has a symmetric shape. (:issue:`9736`)

- Bug in ``Table.select_column`` where name is not preserved (:issue:`10392`)
Expand Down
9 changes: 9 additions & 0 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2684,6 +2684,15 @@ def value_counts(self, normalize=False, sort=True, ascending=False,

return Series(out, index=mi)

def count(self):
ids, _, ngroups = self.grouper.group_info
val = self.obj.get_values()

mask = (ids != -1) & ~isnull(val)
out = np.bincount(ids[mask], minlength=ngroups) if ngroups != 0 else []

return Series(out, index=self.grouper.result_index, name=self.name)

def _apply_to_column_groupbys(self, func):
""" return a pass thru """
return func(self)
Expand Down
29 changes: 13 additions & 16 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1117,27 +1117,24 @@ def count(self, level=None):
-------
nobs : int or Series (if level specified)
"""
if level is not None:
mask = notnull(self.values)
from pandas.core.index import _get_na_value

if isinstance(level, compat.string_types):
level = self.index._get_level_number(level)
if level is None:
return notnull(_values_from_object(self)).sum()

level_index = self.index.levels[level]
if isinstance(level, compat.string_types):
level = self.index._get_level_number(level)

if len(self) == 0:
return self._constructor(0, index=level_index)\
.__finalize__(self)
lev = self.index.levels[level]
lab = np.array(self.index.labels[level], subok=False, copy=True)

# call cython function
max_bin = len(level_index)
labels = com._ensure_int64(self.index.labels[level])
counts = lib.count_level_1d(mask.view(np.uint8),
labels, max_bin)
return self._constructor(counts,
index=level_index).__finalize__(self)
mask = lab == -1
if mask.any():
lab[mask] = cnt = len(lev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than adding this almost identical code, doesnt it make sense to:

return self.groupby(level=level).count().__finalize__(self) ?

for the level not None case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just removing the bug from current implementation. i don't think having an optional level argument would be useful here as it is basically equivalent to groupby on level. that said, groupby removes nulls from keys:

In [5]: ts
Out[5]: 
a  1      0
   2      1
b  2      2
   NaN    3
c  1      4
   2      5
dtype: int64

In [6]: ts.groupby(level=1).count()
Out[6]: 
1    2
2    3
dtype: int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand. the expression I gave above is equivalent to what you wrote, yes? its is just dispatching to the groupby impl. (which is how all of the other stat functions which accept level work).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expression I gave above is equivalent to what you wrote, yes?

yes, if index does not include nan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is why we having a special case implementation, when simply using s.groupby(level=level).count() is acceptable?

and this is what all of the other make_stat_* functions do.

Since this was a bug nothing is even being eliminated.

lev = lev.insert(cnt, _get_na_value(lev.dtype.type))

return notnull(_values_from_object(self)).sum()
out = np.bincount(lab[notnull(self.values)], minlength=len(lev))
return self._constructor(out, index=lev).__finalize__(self)

def mode(self):
"""Returns the mode(s) of the dataset.
Expand Down
17 changes: 0 additions & 17 deletions pandas/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1253,23 +1253,6 @@ def lookup_values(ndarray[object] values, dict mapping):
return maybe_convert_objects(result)


def count_level_1d(ndarray[uint8_t, cast=True] mask,
ndarray[int64_t] labels, Py_ssize_t max_bin):
cdef:
Py_ssize_t i, n
ndarray[int64_t] counts

counts = np.zeros(max_bin, dtype='i8')

n = len(mask)

for i from 0 <= i < n:
if mask[i]:
counts[labels[i]] += 1

return counts


def count_level_2d(ndarray[uint8_t, ndim=2, cast=True] mask,
ndarray[int64_t] labels, Py_ssize_t max_bin):
cdef:
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/test_series.py
Original file line number Diff line number Diff line change
Expand Up @@ -4740,6 +4740,16 @@ def test_count(self):

self.assertEqual(self.ts.count(), np.isfinite(self.ts).sum())

mi = MultiIndex.from_arrays([list('aabbcc'), [1, 2, 2, nan, 1, 2]])
ts = Series(np.arange(len(mi)), index=mi)

left = ts.count(level=1)
right = Series([2, 3, 1], index=[1, 2, nan])
assert_series_equal(left, right)

ts.iloc[[0, 3, 5]] = nan
assert_series_equal(ts.count(level=1), right - 1)

def test_dtype(self):

self.assertEqual(self.ts.dtype, np.dtype('float64'))
Expand Down