Skip to content

BUG: group_shift_indexer checks for null group keys #13819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

ivannz
Copy link
Contributor

@ivannz ivannz commented Jul 27, 2016

@ivannz
Copy link
Contributor Author

ivannz commented Jul 27, 2016

Upon further consideration I came to the conclusion that the FIX could also have looked like this:

...
                # Skip null keys
                if lab == -1:
                    out[ii] = -1
                    continue
...

Now, this raises the question of how groups with partially missing keys should be handled:

  • either partial matching of group keys is used, in which case the -1 in labels should be converted to proper group identifiers (probably, use get_group_index with xnull=False in _get_compressed_labels);
  • or the incomplete keys are ignored (like in nunique(), value_counts()), and specifically in shift() should be filled with missing values.

For example, as far as I understand, in SQL WHERE clauses NULL != NULL, but in GROUP BY clauses NULL == NULL. This means that if pandas's groupby() were to be compatible with SQL, it should perform partial key matching.

What do you think, @jreback?

For example how should shifted groupings by ['B', 'C'] of this

       B    C     F
0    0.0  NaN   0.0
1    1.0  1.0   1.0
2    2.0  2.0   2.0
3    3.0  NaN   3.0
4    4.0  1.0   4.0
5    5.0  2.0   5.0
6    6.0  NaN   6.0
7    7.0  1.0   7.0
8    8.0  2.0   8.0
9    9.0  NaN   9.0
10  10.0  1.0  10.0
11  11.0  2.0  11.0
12   0.0  NaN  12.0
13   1.0  1.0  13.0
14   2.0  2.0  14.0
15   3.0  NaN  15.0
16   4.0  1.0  16.0

differ from this

       B    C     F
0    0.0  0.0   0.0
1    1.0  1.0   1.0
2    2.0  2.0   2.0
3    3.0  0.0   3.0
4    4.0  1.0   4.0
5    5.0  2.0   5.0
6    6.0  0.0   6.0
7    7.0  1.0   7.0
8    8.0  2.0   8.0
9    9.0  0.0   9.0
10  10.0  1.0  10.0
11  11.0  2.0  11.0
12   0.0  0.0  12.0
13   1.0  1.0  13.0
14   2.0  2.0  14.0
15   3.0  0.0  15.0
16   4.0  1.0  16.0

The shifted grouping of the latter gives

       F
0   12.0
1   13.0
2   14.0
3   15.0
4   16.0
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN

In the case of partial groupby keys the second alternative could yield:

       F
0    NaN
1   13.0
2   14.0
3    NaN
4   16.0
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN

I used this code

import pandas as pd
df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
                   for i in range(17)], dtype=float,
                  columns=["B", "C", "F"], index=None)

df = pd.DataFrame([(i%12, i%3, i) for i in range(17)], dtype=float,
                  columns=["B", "C", "F"], index=None)

result = df.groupby(['B', 'C']).shift(-1)

@codecov-io
Copy link

codecov-io commented Jul 27, 2016

Current coverage is 85.25% (diff: 100%)

Merging #13819 into master will not change coverage

@@             master     #13819   diff @@
==========================================
  Files           140        140          
  Lines         50449      50449          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43008      43008          
  Misses         7441       7441          
  Partials          0          0          

Powered by Codecov. Last update 2c55f28...bddf799

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

pls add some tests

@ivannz
Copy link
Contributor Author

ivannz commented Jul 27, 2016

Ok, I will make a concise test and add it to test_groupby.py.

@chris-b1
Copy link
Contributor

chris-b1 commented Jul 27, 2016

Note that the file you're editing is from a template, you will need to make the change here:
https://github.com/pydata/pandas/blob/master/pandas/src/algos_groupby_helper.pxi.in

I think your second approach is correct (or at least consistent) - in fact the current impl seems to handle it?

In [1]: pd.__version__
Out[1]: u'0.18.1'

In [2]: df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
   ...:                    for i in range(17)], dtype=float,
   ...:                   columns=["B", "C", "F"], index=None)

In [3]: df.groupby(['B', 'C']).shift(-1)
Out[3]:
       F
0    NaN
1   13.0
2   14.0
3    NaN
4   16.0
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN

edit: ok, actually it seem to be dumb luck the current version (which I had written) gives this answer - it has bounds issues, but just sometimes doesn't crash. But, it is also what shift() would have done before this bug was introduced.

In [7]: pd.__version__
Out[7]: '0.16.1'

In [8]: df = pd.DataFrame([(i%12, i%3 if i%3 else float("nan"), i)
   ...:                    for i in range(17)], dtype=float,
   ...:                   columns=["B", "C", "F"], index=None)

In [9]: df.groupby(['B', 'C']).shift(-1)
Out[9]: 
     F
0  NaN
1   13
2   14
3  NaN
4   16
5  NaN
6  NaN
7  NaN
8  NaN
9  NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN

@@ -1356,6 +1356,12 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels,
## reverse iterator if shifting backwards
ii = offset + sign * i
lab = labels[ii]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not modify pxi directly. It is generated from corresponding pxi.in.

# values in column `B`, and then group by [`A`, `B`]. This should
# force `-1` in `labels` array of `gr_.grouper.group_info` exactly
# at those places, where the group-by key is partilly missing.
df = pd.DataFrame([(i % 12, i % 3 if i % 3 else float("nan"), i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.nan

@ivannz
Copy link
Contributor Author

ivannz commented Jul 29, 2016

Can somebody tell me, please, what my actions should be when the Travis CI job #21159 of this PR fails, whilst the build job for the same commit on my forked (and rebased) repository succeeded.

@jorisvandenbossche
Copy link
Member

@ivannz Something went wrong with that build, but probably not your fault. I restarted the build

@ivannz
Copy link
Contributor Author

ivannz commented Jul 29, 2016

Great! Thank you, @jorisvandenbossche .

@jreback jreback added this to the 0.19.0 milestone Jul 29, 2016
@jreback jreback closed this in 54b2777 Jul 29, 2016
@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

thanks @ivannz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby and shift causing coredump
6 participants