-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: group_shift_indexer checks for null group keys #13819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Upon further consideration I came to the conclusion that the FIX could also have looked like this: ...
# Skip null keys
if lab == -1:
out[ii] = -1
continue
... Now, this raises the question of how groups with partially missing keys should be handled:
For example, as far as I understand, in SQL WHERE clauses NULL != NULL, but in GROUP BY clauses NULL == NULL. This means that if pandas's groupby() were to be compatible with SQL, it should perform partial key matching. What do you think, @jreback? For example how should shifted groupings by ['B', 'C'] of this
differ from this
The shifted grouping of the latter gives
In the case of partial groupby keys the second alternative could yield:
I used this code import pandas as pd
df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
for i in range(17)], dtype=float,
columns=["B", "C", "F"], index=None)
df = pd.DataFrame([(i%12, i%3, i) for i in range(17)], dtype=float,
columns=["B", "C", "F"], index=None)
result = df.groupby(['B', 'C']).shift(-1) |
Current coverage is 85.25% (diff: 100%)@@ master #13819 diff @@
==========================================
Files 140 140
Lines 50449 50449
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 43008 43008
Misses 7441 7441
Partials 0 0
|
pls add some tests |
Ok, I will make a concise test and add it to test_groupby.py. |
Note that the file you're editing is from a template, you will need to make the change here: I think your second approach is correct (or at least consistent) - in fact the current impl seems to handle it? In [1]: pd.__version__
Out[1]: u'0.18.1'
In [2]: df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
...: for i in range(17)], dtype=float,
...: columns=["B", "C", "F"], index=None)
In [3]: df.groupby(['B', 'C']).shift(-1)
Out[3]:
F
0 NaN
1 13.0
2 14.0
3 NaN
4 16.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN edit: ok, actually it seem to be dumb luck the current version (which I had written) gives this answer - it has bounds issues, but just sometimes doesn't crash. But, it is also what In [7]: pd.__version__
Out[7]: '0.16.1'
In [8]: df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
...: for i in range(17)], dtype=float,
...: columns=["B", "C", "F"], index=None)
In [9]: df.groupby(['B', 'C']).shift(-1)
Out[9]:
F
0 NaN
1 13
2 14
3 NaN
4 16
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN |
@@ -1356,6 +1356,12 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels, | |||
## reverse iterator if shifting backwards | |||
ii = offset + sign * i | |||
lab = labels[ii] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not modify pxi
directly. It is generated from corresponding pxi.in
.
# values in column `B`, and then group by [`A`, `B`]. This should | ||
# force `-1` in `labels` array of `gr_.grouper.group_info` exactly | ||
# at those places, where the group-by key is partilly missing. | ||
df = pd.DataFrame([(i % 12, i % 3 if i % 3 else float("nan"), i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.nan
@ivannz Something went wrong with that build, but probably not your fault. I restarted the build |
Great! Thank you, @jorisvandenbossche . |
thanks @ivannz |
git diff upstream/master | flake8 --diff