PERF: improves performance in SeriesGroupBy.count #10946

behzadnouri · 2015-08-31T02:11:16Z

BUG: closes bug in Series.count when index has nulls

In [4]: ts
Out[4]:
a  1      0
   2      1
b  2      2
   NaN    3
c  1      4
   2      5
dtype: int64

In [5]: ts.count(level=1)
Out[5]:
1    2
2    4          # <<< BUG!
dtype: int64

In [6]: from string import ascii_lowercase

In [7]: np.random.seed(2718281)

In [8]: n = 1 << 21

In [9]: df = DataFrame({
   ...:     '1st':np.random.choice(list(ascii_lowercase), n),
   ...:     '2nd':np.random.randint(0, n // 100, n),
   ...:     '3rd':np.random.randn(n).round(3)})

In [10]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan

In [11]:

In [11]: gr = df.groupby(['1st', '2nd'])['3rd']

In [12]: %timeit gr.count()
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 86.4 ms per loop

In [13]: %timeit gr.count()
10 loops, best of 3: 87 ms per loop

on branch:

In [5]: ts.count(level=1)
Out[5]:
 1     2
 2     3
NaN    1
dtype: int64

...

In [12]: %timeit gr.count()
The slowest run took 12.29 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 43.1 ms per loop

In [13]: %timeit gr.count()
10 loops, best of 3: 43.5 ms per loop

jreback · 2015-08-31T10:54:13Z

pandas/core/series.py

-                                     index=level_index).__finalize__(self)
+        mask = lab == -1
+        if mask.any():
+            lab[mask] = cnt = len(lev)


rather than adding this almost identical code, doesnt it make sense to:

return self.groupby(level=level).count().__finalize__(self) ?

for the level not None case

this is just removing the bug from current implementation. i don't think having an optional level argument would be useful here as it is basically equivalent to groupby on level. that said, groupby removes nulls from keys:

In [5]: ts Out[5]: a 1 0 2 1 b 2 2 NaN 3 c 1 4 2 5 dtype: int64 In [6]: ts.groupby(level=1).count() Out[6]: 1 2 2 3 dtype: int64

not sure I understand. the expression I gave above is equivalent to what you wrote, yes? its is just dispatching to the groupby impl. (which is how all of the other stat functions which accept level work).

the expression I gave above is equivalent to what you wrote, yes?

yes, if index does not include nan.

my point is why we having a special case implementation, when simply using s.groupby(level=level).count() is acceptable?

and this is what all of the other make_stat_* functions do.

Since this was a bug nothing is even being eliminated.

jreback · 2015-09-02T11:45:53Z

@behzadnouri pls rebase and make the changes as above

BUG: closes bug in Series.count when index has nulls

jreback · 2015-09-05T14:17:02Z

@behzadnouri can you update according to comments

behzadnouri · 2015-09-05T14:26:09Z

@jreback the added test will fail if i change it what u suggest.

jreback · 2015-09-05T15:17:22Z

@behzadnouri I just don't think its worth it to support this kind of behavior in Series.count which is inconsitent with all other stat functions on how they handle levels, they just dispatch to groupby.

behzadnouri · 2015-09-05T15:23:24Z

then plz go ahead and change it

jreback · 2015-09-05T15:49:53Z

#443 would fix this, e.g. we need an option like:

s.count(level=1, skipna=False)

jreback · 2015-09-05T16:32:01Z

hmm, looks like my way would break some existing tests....ok will merge as is, and when we eventually add nan group handling in groupby can simplify this code.

behzadnouri · 2015-09-05T16:34:27Z

need rebase?

BUG: closes bug in Series.count when index has nulls

jreback · 2015-09-05T16:36:35Z

I just did it. thxs.

jreback · 2015-09-05T16:36:52Z

merged via 33723f9

BUG: closes bug in Series.count when index has nulls

behzadnouri force-pushed the grby-count branch from 65e2142 to 5f96530 Compare August 31, 2015 02:11

jreback added Groupby Performance Memory or execution speed performance labels Aug 31, 2015

jreback added this to the 0.17.0 milestone Aug 31, 2015

jreback reviewed Aug 31, 2015
View reviewed changes

behzadnouri force-pushed the grby-count branch from 5f96530 to a6bc569 Compare August 31, 2015 11:58

behzadnouri force-pushed the grby-count branch from a6bc569 to 169a8f0 Compare September 2, 2015 11:57

PERF: improves performance in SeriesGroupBy.count

ca00c4d

BUG: closes bug in Series.count when index has nulls

behzadnouri force-pushed the grby-count branch from 169a8f0 to ca00c4d Compare September 4, 2015 00:18

jreback pushed a commit that referenced this pull request Sep 5, 2015

PERF: improves performance in SeriesGroupBy.count, #10946

33723f9

BUG: closes bug in Series.count when index has nulls

jreback closed this Sep 5, 2015

behzadnouri deleted the grby-count branch September 5, 2015 16:46

nickeubank pushed a commit to nickeubank/pandas that referenced this pull request Sep 29, 2015

PERF: improves performance in SeriesGroupBy.count, pandas-dev#10946

4832e34

BUG: closes bug in Series.count when index has nulls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improves performance in SeriesGroupBy.count #10946

PERF: improves performance in SeriesGroupBy.count #10946

behzadnouri commented Aug 31, 2015

jreback Aug 31, 2015

behzadnouri Aug 31, 2015

jreback Aug 31, 2015

behzadnouri Aug 31, 2015

jreback Aug 31, 2015

jreback commented Sep 2, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015

PERF: improves performance in SeriesGroupBy.count #10946

PERF: improves performance in SeriesGroupBy.count #10946

Conversation

behzadnouri commented Aug 31, 2015

jreback Aug 31, 2015

Choose a reason for hiding this comment

behzadnouri Aug 31, 2015

Choose a reason for hiding this comment

jreback Aug 31, 2015

Choose a reason for hiding this comment

behzadnouri Aug 31, 2015

Choose a reason for hiding this comment

jreback Aug 31, 2015

Choose a reason for hiding this comment

jreback commented Sep 2, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015