BUG: group_shift_indexer checks for null group keys #13819

ivannz · 2016-07-27T13:07:58Z

closes Groupby and shift causing coredump #13813
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

ivannz · 2016-07-27T14:01:31Z

Upon further consideration I came to the conclusion that the FIX could also have looked like this:

...
                # Skip null keys
                if lab == -1:
                    out[ii] = -1
                    continue
...

Now, this raises the question of how groups with partially missing keys should be handled:

either partial matching of group keys is used, in which case the -1 in labels should be converted to proper group identifiers (probably, use get_group_index with xnull=False in _get_compressed_labels);
or the incomplete keys are ignored (like in nunique(), value_counts()), and specifically in shift() should be filled with missing values.

For example, as far as I understand, in SQL WHERE clauses NULL != NULL, but in GROUP BY clauses NULL == NULL. This means that if pandas's groupby() were to be compatible with SQL, it should perform partial key matching.

What do you think, @jreback?

For example how should shifted groupings by ['B', 'C'] of this

       B    C     F
0    0.0  NaN   0.0
1    1.0  1.0   1.0
2    2.0  2.0   2.0
3    3.0  NaN   3.0
4    4.0  1.0   4.0
5    5.0  2.0   5.0
6    6.0  NaN   6.0
7    7.0  1.0   7.0
8    8.0  2.0   8.0
9    9.0  NaN   9.0
10  10.0  1.0  10.0
11  11.0  2.0  11.0
12   0.0  NaN  12.0
13   1.0  1.0  13.0
14   2.0  2.0  14.0
15   3.0  NaN  15.0
16   4.0  1.0  16.0

differ from this

       B    C     F
0    0.0  0.0   0.0
1    1.0  1.0   1.0
2    2.0  2.0   2.0
3    3.0  0.0   3.0
4    4.0  1.0   4.0
5    5.0  2.0   5.0
6    6.0  0.0   6.0
7    7.0  1.0   7.0
8    8.0  2.0   8.0
9    9.0  0.0   9.0
10  10.0  1.0  10.0
11  11.0  2.0  11.0
12   0.0  0.0  12.0
13   1.0  1.0  13.0
14   2.0  2.0  14.0
15   3.0  0.0  15.0
16   4.0  1.0  16.0

The shifted grouping of the latter gives

       F
0   12.0
1   13.0
2   14.0
3   15.0
4   16.0
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN

In the case of partial groupby keys the second alternative could yield:

       F
0    NaN
1   13.0
2   14.0
3    NaN
4   16.0
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN

I used this code

import pandas as pd
df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
                   for i in range(17)], dtype=float,
                  columns=["B", "C", "F"], index=None)

df = pd.DataFrame([(i%12, i%3, i) for i in range(17)], dtype=float,
                  columns=["B", "C", "F"], index=None)

result = df.groupby(['B', 'C']).shift(-1)

codecov-io · 2016-07-27T20:28:07Z

Current coverage is 85.25% (diff: 100%)

Merging #13819 into master will not change coverage

@@             master     #13819   diff @@
==========================================
  Files           140        140          
  Lines         50449      50449          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43008      43008          
  Misses         7441       7441          
  Partials          0          0

Powered by Codecov. Last update 2c55f28...bddf799

jreback · 2016-07-27T21:26:41Z

pls add some tests

ivannz · 2016-07-27T21:46:59Z

Ok, I will make a concise test and add it to test_groupby.py.

chris-b1 · 2016-07-27T23:52:48Z

Note that the file you're editing is from a template, you will need to make the change here:
https://github.com/pydata/pandas/blob/master/pandas/src/algos_groupby_helper.pxi.in

I think your second approach is correct (or at least consistent) - in fact the current impl seems to handle it?

In [1]: pd.__version__
Out[1]: u'0.18.1'

In [2]: df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
   ...:                    for i in range(17)], dtype=float,
   ...:                   columns=["B", "C", "F"], index=None)

In [3]: df.groupby(['B', 'C']).shift(-1)
Out[3]:
       F
0    NaN
1   13.0
2   14.0
3    NaN
4   16.0
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN

edit: ok, actually it seem to be dumb luck the current version (which I had written) gives this answer - it has bounds issues, but just sometimes doesn't crash. But, it is also what shift() would have done before this bug was introduced.

In [7]: pd.__version__
Out[7]: '0.16.1'

In [8]: df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
   ...:                    for i in range(17)], dtype=float,
   ...:                   columns=["B", "C", "F"], index=None)

In [9]: df.groupby(['B', 'C']).shift(-1)
Out[9]: 
     F
0  NaN
1   13
2   14
3  NaN
4   16
5  NaN
6  NaN
7  NaN
8  NaN
9  NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN

sinhrks · 2016-07-28T00:23:56Z

pandas/src/algos_groupby_helper.pxi

@@ -1356,6 +1356,12 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels,
                ## reverse iterator if shifting backwards
                ii = offset + sign * i
                lab = labels[ii]
+


do not modify pxi directly. It is generated from corresponding pxi.in.

jreback · 2016-07-28T17:53:58Z

pandas/tests/test_groupby.py

+        # values in column `B`, and then group by [`A`, `B`]. This should
+        # force `-1` in `labels` array of `gr_.grouper.group_info` exactly
+        # at those places, where the group-by key is partilly missing.
+        df = pd.DataFrame([(i % 12, i % 3 if i % 3 else float("nan"), i)


use np.nan

ivannz · 2016-07-29T07:56:53Z

Can somebody tell me, please, what my actions should be when the Travis CI job #21159 of this PR fails, whilst the build job for the same commit on my forked (and rebased) repository succeeded.

jorisvandenbossche · 2016-07-29T08:05:23Z

@ivannz Something went wrong with that build, but probably not your fault. I restarted the build

ivannz · 2016-07-29T08:11:58Z

Great! Thank you, @jorisvandenbossche .

jreback · 2016-07-29T10:23:06Z

thanks @ivannz

jreback added Bug Groupby labels Jul 27, 2016

sinhrks reviewed Jul 28, 2016
View reviewed changes

ivannz force-pushed the issue13813fix branch from 92035f8 to 7adc9e2 Compare July 28, 2016 05:50

ivannz mentioned this pull request Jul 28, 2016

Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

Closed

jreback reviewed Jul 28, 2016
View reviewed changes

ivannz added 5 commits July 29, 2016 10:07

BUG: group_shift_indexer checks for null group keys

966d5c6

Treat incomplete group keys as distinct when shifting

fe2f0ec

Patched the template, and added a test for '.shift()'

94bae0b

minor flake8 style corrections

d92cf3c

Added bugfix description [ci skip]

eab8038

ivannz force-pushed the issue13813fix branch from 0798ee7 to bddf799 Compare July 29, 2016 07:25

Switched from float('nan') to np.nan

bddf799

jreback added this to the 0.19.0 milestone Jul 29, 2016

jreback closed this in 54b2777 Jul 29, 2016

ivannz deleted the issue13813fix branch July 29, 2016 11:12

jreback mentioned this pull request Aug 12, 2016

BUG: Kernel crashing when using shift with groupby when there are NaNs in group column #13976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: group_shift_indexer checks for null group keys #13819

BUG: group_shift_indexer checks for null group keys #13819

ivannz commented Jul 27, 2016 •

edited

Loading

ivannz commented Jul 27, 2016 •

edited

Loading

codecov-io commented Jul 27, 2016 •

edited

Loading

jreback commented Jul 27, 2016

ivannz commented Jul 27, 2016

chris-b1 commented Jul 27, 2016 •

edited

Loading

sinhrks Jul 28, 2016

jreback Jul 28, 2016

ivannz commented Jul 29, 2016 •

edited

Loading

jorisvandenbossche commented Jul 29, 2016

ivannz commented Jul 29, 2016

jreback commented Jul 29, 2016

BUG: group_shift_indexer checks for null group keys #13819

BUG: group_shift_indexer checks for null group keys #13819

Conversation

ivannz commented Jul 27, 2016 • edited Loading

ivannz commented Jul 27, 2016 • edited Loading

codecov-io commented Jul 27, 2016 • edited Loading

Current coverage is 85.25% (diff: 100%)

jreback commented Jul 27, 2016

ivannz commented Jul 27, 2016

chris-b1 commented Jul 27, 2016 • edited Loading

sinhrks Jul 28, 2016

Choose a reason for hiding this comment

jreback Jul 28, 2016

Choose a reason for hiding this comment

ivannz commented Jul 29, 2016 • edited Loading

jorisvandenbossche commented Jul 29, 2016

ivannz commented Jul 29, 2016

jreback commented Jul 29, 2016

ivannz commented Jul 27, 2016 •

edited

Loading

ivannz commented Jul 27, 2016 •

edited

Loading

codecov-io commented Jul 27, 2016 •

edited

Loading

chris-b1 commented Jul 27, 2016 •

edited

Loading

ivannz commented Jul 29, 2016 •

edited

Loading