-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improves performance in GroupBy.cumcount #11039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
indices.append(v) | ||
run = np.r_[True, ids[:-1] != ids[1:]] | ||
rep = np.diff(np.r_[np.nonzero(run)[0], count]) | ||
out = (~run).cumsum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually np.intp
is ok. its this:
In [1]: np.array([True,False,True]).cumsum()
Out[1]: array([1, 1, 2])
In [2]: np.array([True,False,True]).cumsum().dtype
Out[2]: dtype('int32')
must be some weird casting on windows. providing an accumulator dtype seems to work.
In [3]: np.array([True,False,True]).cumsum(dtype='int64').dtype
Out[3]: dtype('int64')
well i am not adding a new feature or closing a bug; so the tests already there should be fine |
assert_series_equal(g.B.nth(0), df.B.iloc[[0, 2]]) | ||
assert_series_equal(g.B.nth(1), df.B.iloc[[1]]) | ||
assert_frame_equal(g.nth(-3), df.loc[[]].set_index('A')) | ||
assert_series_equal(g.B.nth(0), df.set_index('A').B.iloc[[0, 2]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to document this. It was incorrect before (that the index was not set correctly), so its a bug-fix. Pls show an example in the whatsnew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is even an API change I would say (or at least mention it there). nth
on a SeriesGroupBy has always been like that. The method on DataframeGroupby was made a reducing method instead of a filtering method on purpose in 0.14 (#7044), but I think we forgot SeriesGroupBy then?
Also the docstring of nth
is still incorrect I see (but for frame, so that is a separate issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback i am not sure what u want me to put in what's new; such things would be much more efficient if u take care of it urself. even if i do it, it may not be what u had in mind, and then we have to keep going back and forth.
the other thing is that this whole index and dropna is broken; in addition to #11038 there is this:
>>> df
1st 2nd
0 a 0
1 a 1
2 a 2
3 b NaN
4 b 4
>>> df.groupby('1st')['2nd'].nth(0, dropna=False)
0 0
3 NaN
Name: 2nd, dtype: float64
>>> df.groupby('1st')['2nd'].nth(0, dropna=True)
1st
a NaN
b NaN
Name: 2nd, dtype: float64
so not sure if there is a point in documenting it when it is not working.
@behzadnouri can you do a whatsnew note |
Are we sure we want to change this? This has always been like that I think? (but indeed, inconsistent between SeriesGroupBy and DataFrameGroupBy ..) @hayd this behaviour is already in 0.14.1 (while the list enhancement was only added in 0.15.0) |
I think this should be changed as @behzadnouri suggest. Code is MUCH simpler. and everything would be consistent. I think what |
This is really a bug-fix, but since is a fundamental change (e.g. a position based index vs the original index) being returned it ought to be highlited to the user. This just needs a simple Previous Behavior / New Behavior section with a code-block from the previous and actual from the new.
v0.16.2
This PR
|
nth was supposed to be of the "filtering" type rather than aggregating type. Hence the v0.16.2 behaviour. |
@hayd right! I forgot about that. so what should we do with all of this, the prior code had quite a number of 'hacks' to support that exact behavior. |
moving to next version. This needs to not change the API. The result index can be set differently at the end. |
api is the function signature which does not change. if you mean the behaviour, it is inconsistent between series and data-frame, and even within series if there are nulls; and seems to be a bug, not by design. so at some point you need to make series behave like frames or the other way around. >>> df
A B
0 a 0
1 b 1
>>> df.groupby('A').nth(0)
B
A
a 0
b 1
>>> df.groupby('A')['B'].nth(0)
0 0
1 1
Name: B, dtype: int64
>>> ts
0 0
1 NaN
dtype: float64
>>> ts.groupby(Series(['a', 'b'])).nth(0, dropna=True)
a NaN
b NaN
dtype: float64 |
Series should behave like DataFrame (and filter rather than agg). I was sure these worked the same on Series or it did behind a flag (which was not able to be the default due to old behaviour). |
@hayd I think we are not using the same terminology :-) Currently, nth DataFrame behaviour is aggregating/reducing (so setting the grouping keys as the index), while Series behaviour is filtering (keeping the original index labels) (so the other way around than I understand from your comment) Originally By the introduction of the possibility to pass lists to get multiple values, it complicated this a bit, as a real aggregating method will only return one value per group. Further, with the introduction of |
IIRC (and has been a while), we want to make |
@jreback and that is true for a DataFrame, but not Series (so this PR makes that analogy more consistent) |
Now, for a solution. I think we just have to make a choice ..
If we change something, I would go for 2), as the filtering behaviour is easy to get by using |
ahh, ok this was a consistency-fix partially then. I would agree with 2). |
Some other inconsistencies:
|
these are already noted in #5755 (though not so explicity) |
But here it is possibly on purpose, as for |
can you rebase/update |
e69ded7
to
b0c973c
Compare
@jreback rebased |
@behzadnouri ok thanks. Can you put a whatsnew sub-section that shows the changes as a user would care about them, IOW an example (you can take from tests) showing what used to happend and the (correct) new way. |
29c7fb8
to
6115d4c
Compare
@jreback added an example to whatsnew showing the changes |
|
||
New Behavior: | ||
|
||
.. code-block:: ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use an ipython block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ipython blocks in the new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback from what i see from similar cases, this is already in correct form; v0.18.1.txt#L159 as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes for the Previous Behvaior, not the NEW. pls make the change
6115d4c
to
3b08468
Compare
@jreback moved to api changes section |
""" | ||
arr is where cumcount gets its values from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a Parameters section
3b08468
to
1ffab15
Compare
@jreback made the changes |
Out[5]: | ||
0 1 | ||
1 2 | ||
Name: B, dtype: int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show the same as_index=False here as well (as u r showing below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the point here is that as_index=True
is ignored in old behaviour; as_index=False
is not relevant or informative (and has not changed).
this has been going back and forth for too many times. if you like to add/modify anything please do so on your end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are showing it in the new, then pls show it in the original.
this has been going back and forth for too many times.
Well, that's just how it is. The docs have to be in the proper format and be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback please go ahead and make any changes you find necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzadnouri of course, but that's not the point is it. ok thank you for the PR.
@jorisvandenbossche @TomAugspurger comments? |
Looks good on a quick skim. @behzadnouri you seem to also fixed a bug where # master
In [5]: df.groupby('c', sort=True).nth(1)
Out[5]:
a b
c
0 -0.029029 0.565333
1 0.186213 1.110464
2 0.982333 -0.544459
3 -0.626740 -0.541241
In [6]: df.groupby('c', sort=False).nth(1)
Out[6]:
a b
c
0 -0.029029 0.565333
1 0.186213 1.110464
2 0.982333 -0.544459
3 -0.626740 -0.541241 vs. # your branch
In [1]: df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])
In [2]: df['c'] = np.random.randint(0, 4, 100)
In [3]: df.groupby('c', sort=True).nth(1)
Out[3]:
a b
c
0 -1.168300 -2.224763
1 -0.562298 -1.262734
2 -0.439613 0.236592
3 -0.499235 -0.808404
In [4]: df.groupby('c', sort=False).nth(1)
Out[4]:
a b
c
1 -0.562298 -1.262734
2 -0.439613 0.236592
3 -0.499235 -0.808404
0 -1.168300 -2.224763 Would be nice to have a release note for that to (could maybe do on merge?). |
thanks @behzadnouri @TomAugspurger your suggestions incorporated as well! |
closes #12839