Skip to content

Commit 54b2777

Browse files
ivannzjreback
authored andcommitted
BUG: group_shift_indexer checks for null group keys
closes pandas-dev#13813 Author: Ivan Nazarov <[email protected]> Closes pandas-dev#13819 from ivannz/issue13813fix and squashes the following commits: bddf799 [Ivan Nazarov] Switched from float('nan') to np.nan eab8038 [Ivan Nazarov] Added bugfix description [ci skip] d92cf3c [Ivan Nazarov] minor flake8 style corrections 94bae0b [Ivan Nazarov] Patched the template, and added a test for '.shift()' fe2f0ec [Ivan Nazarov] Treat incomplete group keys as distinct when shifting 966d5c6 [Ivan Nazarov] BUG: group_shift_indexer checks for null group keys
1 parent 748787d commit 54b2777

File tree

4 files changed

+54
-17
lines changed

4 files changed

+54
-17
lines changed

doc/source/whatsnew/v0.19.0.txt

+21-17
Original file line numberDiff line numberDiff line change
@@ -266,37 +266,40 @@ New Index methods
266266

267267
Following methods and options are added to ``Index`` to be more consistent with ``Series`` and ``DataFrame``.
268268

269-
- ``Index`` now supports the ``.where()`` function for same shape indexing (:issue:`13170`)
269+
``Index`` now supports the ``.where()`` function for same shape indexing (:issue:`13170`)
270270

271-
.. ipython:: python
271+
.. ipython:: python
272272

273-
idx = pd.Index(['a', 'b', 'c'])
274-
idx.where([True, False, True])
273+
idx = pd.Index(['a', 'b', 'c'])
274+
idx.where([True, False, True])
275275

276276

277-
- ``Index`` now supports ``.dropna`` to exclude missing values (:issue:`6194`)
277+
``Index`` now supports ``.dropna`` to exclude missing values (:issue:`6194`)
278278

279-
.. ipython:: python
279+
.. ipython:: python
280280

281-
idx = pd.Index([1, 2, np.nan, 4])
282-
idx.dropna()
281+
idx = pd.Index([1, 2, np.nan, 4])
282+
idx.dropna()
283283

284284
For ``MultiIndex``, values are dropped if any level is missing by default. Specifying
285285
``how='all'`` only drops values where all levels are missing.
286286

287-
midx = pd.MultiIndex.from_arrays([[1, 2, np.nan, 4],
287+
.. ipython:: python
288+
289+
midx = pd.MultiIndex.from_arrays([[1, 2, np.nan, 4],
288290
[1, 2, np.nan, np.nan]])
289-
midx
290-
midx.dropna()
291-
midx.dropna(how='all')
291+
midx
292+
midx.dropna()
293+
midx.dropna(how='all')
292294

293-
- ``Index.astype()`` now accepts an optional boolean argument ``copy``, which allows optional copying if the requirements on dtype are satisfied (:issue:`13209`)
294-
- ``Index`` now supports ``.str.extractall()`` which returns a ``DataFrame``, the see :ref:`docs here <text.extractall>` (:issue:`10008`, :issue:`13156`)
295+
``Index`` now supports ``.str.extractall()`` which returns a ``DataFrame``, the see :ref:`docs here <text.extractall>` (:issue:`10008`, :issue:`13156`)
295296

296-
.. ipython:: python
297+
.. ipython:: python
298+
299+
idx = pd.Index(["a1a2", "b1", "c1"])
300+
idx.str.extractall("[ab](?P<digit>\d)")
297301

298-
idx = pd.Index(["a1a2", "b1", "c1"])
299-
idx.str.extractall("[ab](?P<digit>\d)")
302+
``Index.astype()`` now accepts an optional boolean argument ``copy``, which allows optional copying if the requirements on dtype are satisfied (:issue:`13209`)
300303

301304
.. _whatsnew_0190.enhancements.other:
302305

@@ -736,6 +739,7 @@ Performance Improvements
736739
Bug Fixes
737740
~~~~~~~~~
738741

742+
- Bug in ``groupby().shift()``, which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (:issue:`13813`)
739743
- Bug in ``pd.read_csv()``, which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (:issue:`13703`)
740744
- Bug in ``io.json.json_normalize()``, where non-ascii keys raised an exception (:issue:`13213`)
741745
- Bug in ``SparseSeries`` with ``MultiIndex`` ``[]`` indexing may raise ``IndexError`` (:issue:`13144`)

pandas/src/algos_groupby_helper.pxi

+6
Original file line numberDiff line numberDiff line change
@@ -1356,6 +1356,12 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels,
13561356
## reverse iterator if shifting backwards
13571357
ii = offset + sign * i
13581358
lab = labels[ii]
1359+
1360+
# Skip null keys
1361+
if lab == -1:
1362+
out[ii] = -1
1363+
continue
1364+
13591365
label_seen[lab] += 1
13601366

13611367
idxer_slot = label_seen[lab] % periods

pandas/src/algos_groupby_helper.pxi.in

+6
Original file line numberDiff line numberDiff line change
@@ -700,6 +700,12 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels,
700700
## reverse iterator if shifting backwards
701701
ii = offset + sign * i
702702
lab = labels[ii]
703+
704+
# Skip null keys
705+
if lab == -1:
706+
out[ii] = -1
707+
continue
708+
703709
label_seen[lab] += 1
704710

705711
idxer_slot = label_seen[lab] % periods

pandas/tests/test_groupby.py

+21
Original file line numberDiff line numberDiff line change
@@ -6560,6 +6560,27 @@ def test_grouping_string_repr(self):
65606560
expected = "Grouping(('A', 'a'))"
65616561
tm.assert_equal(result, expected)
65626562

6563+
def test_group_shift_with_null_key(self):
6564+
# This test is designed to replicate the segfault in issue #13813.
6565+
n_rows = 1200
6566+
6567+
# Generate a moderately large dataframe with occasional missing
6568+
# values in column `B`, and then group by [`A`, `B`]. This should
6569+
# force `-1` in `labels` array of `g.grouper.group_info` exactly
6570+
# at those places, where the group-by key is partilly missing.
6571+
df = DataFrame([(i % 12, i % 3 if i % 3 else np.nan, i)
6572+
for i in range(n_rows)], dtype=float,
6573+
columns=["A", "B", "Z"], index=None)
6574+
g = df.groupby(["A", "B"])
6575+
6576+
expected = DataFrame([(i + 12 if i % 3 and i < n_rows - 12
6577+
else np.nan)
6578+
for i in range(n_rows)], dtype=float,
6579+
columns=["Z"], index=None)
6580+
result = g.shift(-1)
6581+
6582+
assert_frame_equal(result, expected)
6583+
65636584

65646585
def assert_fp_equal(a, b):
65656586
assert (np.abs(a - b) < 1e-12).all()

0 commit comments

Comments
 (0)