Skip to content

BUG: group_shift_indexer checks for null group keys #13819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -735,6 +735,7 @@ Performance Improvements
Bug Fixes
~~~~~~~~~

- Bug in ``groupby().shift()``, which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (:issue:`13813`)
- Bug in ``pd.read_csv()``, which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (:issue:`13703`)
- Bug in ``io.json.json_normalize()``, where non-ascii keys raised an exception (:issue:`13213`)
- Bug in ``SparseSeries`` with ``MultiIndex`` ``[]`` indexing may raise ``IndexError`` (:issue:`13144`)
Expand Down
6 changes: 6 additions & 0 deletions pandas/src/algos_groupby_helper.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -1356,6 +1356,12 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels,
## reverse iterator if shifting backwards
ii = offset + sign * i
lab = labels[ii]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not modify pxi directly. It is generated from corresponding pxi.in.

# Skip null keys
if lab == -1:
out[ii] = -1
continue

label_seen[lab] += 1

idxer_slot = label_seen[lab] % periods
Expand Down
6 changes: 6 additions & 0 deletions pandas/src/algos_groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -700,6 +700,12 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels,
## reverse iterator if shifting backwards
ii = offset + sign * i
lab = labels[ii]

# Skip null keys
if lab == -1:
out[ii] = -1
continue

label_seen[lab] += 1

idxer_slot = label_seen[lab] % periods
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -6560,6 +6560,31 @@ def test_grouping_string_repr(self):
expected = "Grouping(('A', 'a'))"
tm.assert_equal(result, expected)

def test_group_shift_with_null_key(self):
# This test is designed to replicate the segfault in issue #13813.
n_rows = 1200

# Generate a moderately large dataframe with occasional missing
# values in column `B`, and then group by [`A`, `B`]. This should
# force `-1` in `labels` array of `gr_.grouper.group_info` exactly
# at those places, where the group-by key is partilly missing.
df = pd.DataFrame([(i % 12, i % 3 if i % 3 else np.nan, i)
for i in range(n_rows)], dtype=float,
columns=["A", "B", "Z"], index=None)
gr_ = df.groupby(["A", "B"])

# Generate teh expected dataframe
expected = pd.DataFrame([(i % 12, i % 3 if i % 3 else np.nan,
i + 12 if i % 3 and i < n_rows - 12
else np.nan)
for i in range(n_rows)], dtype=float,
columns=["A", "B", "Z"], index=None)
result = gr_.shift(-1)

# Check for data grabbed from beyond the acceptable array bounds
# in case there was no segfault.
tm.assert_frame_equal(result, expected[["Z"]])


def assert_fp_equal(a, b):
assert (np.abs(a - b) < 1e-12).all()
Expand Down