Skip to content

PERF: improves performance in SeriesGroupBy.count #10946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

behzadnouri
Copy link
Contributor

BUG: closes bug in Series.count when index has nulls

In [4]: ts
Out[4]:
a  1      0
   2      1
b  2      2
   NaN    3
c  1      4
   2      5
dtype: int64

In [5]: ts.count(level=1)
Out[5]:
1    2
2    4          # <<< BUG!
dtype: int64

In [6]: from string import ascii_lowercase

In [7]: np.random.seed(2718281)

In [8]: n = 1 << 21

In [9]: df = DataFrame({
   ...:     '1st':np.random.choice(list(ascii_lowercase), n),
   ...:     '2nd':np.random.randint(0, n // 100, n),
   ...:     '3rd':np.random.randn(n).round(3)})

In [10]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan

In [11]:

In [11]: gr = df.groupby(['1st', '2nd'])['3rd']

In [12]: %timeit gr.count()
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 86.4 ms per loop

In [13]: %timeit gr.count()
10 loops, best of 3: 87 ms per loop

on branch:

In [5]: ts.count(level=1)
Out[5]:
 1     2
 2     3
NaN    1
dtype: int64

...

In [12]: %timeit gr.count()
The slowest run took 12.29 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 43.1 ms per loop

In [13]: %timeit gr.count()
10 loops, best of 3: 43.5 ms per loop

@jreback jreback added Groupby Performance Memory or execution speed performance labels Aug 31, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 31, 2015
index=level_index).__finalize__(self)
mask = lab == -1
if mask.any():
lab[mask] = cnt = len(lev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than adding this almost identical code, doesnt it make sense to:

return self.groupby(level=level).count().__finalize__(self) ?

for the level not None case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just removing the bug from current implementation. i don't think having an optional level argument would be useful here as it is basically equivalent to groupby on level. that said, groupby removes nulls from keys:

In [5]: ts
Out[5]: 
a  1      0
   2      1
b  2      2
   NaN    3
c  1      4
   2      5
dtype: int64

In [6]: ts.groupby(level=1).count()
Out[6]: 
1    2
2    3
dtype: int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand. the expression I gave above is equivalent to what you wrote, yes? its is just dispatching to the groupby impl. (which is how all of the other stat functions which accept level work).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expression I gave above is equivalent to what you wrote, yes?

yes, if index does not include nan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is why we having a special case implementation, when simply using s.groupby(level=level).count() is acceptable?

and this is what all of the other make_stat_* functions do.

Since this was a bug nothing is even being eliminated.

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

@behzadnouri pls rebase and make the changes as above

BUG: closes bug in Series.count when index has nulls
@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

@behzadnouri can you update according to comments

@behzadnouri
Copy link
Contributor Author

@jreback the added test will fail if i change it what u suggest.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

@behzadnouri I just don't think its worth it to support this kind of behavior in Series.count which is inconsitent with all other stat functions on how they handle levels, they just dispatch to groupby.

@behzadnouri
Copy link
Contributor Author

then plz go ahead and change it

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

#443 would fix this, e.g. we need an option like:

s.count(level=1, skipna=False)

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

hmm, looks like my way would break some existing tests....ok will merge as is, and when we eventually add nan group handling in groupby can simplify this code.

@behzadnouri
Copy link
Contributor Author

need rebase?

jreback pushed a commit that referenced this pull request Sep 5, 2015
BUG: closes bug in Series.count when index has nulls
@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

I just did it. thxs.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

merged via 33723f9

@jreback jreback closed this Sep 5, 2015
@behzadnouri behzadnouri deleted the grby-count branch September 5, 2015 16:46
nickeubank pushed a commit to nickeubank/pandas that referenced this pull request Sep 29, 2015
BUG: closes bug in Series.count when index has nulls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants