BUG: Fix segfault in GroupBy.count and DataFrame.count #32842

tv3141 · 2020-03-19T21:33:52Z

closes SegmentationFault/BusError: Groupby + count #32841
closes Segfault on clean-up with count example from docstrings #21824
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-03-19T21:33:55Z

Hello @tv3141! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-29 16:22:27 UTC

pandas/_libs/lib.pyx

tv3141 · 2020-03-19T21:48:44Z

So far, this PR does not fix the related #21824.

One option would be to mask rows -1 as well in the calling code as well. But I think it would be good to prevent SegFaults in count_level_2d itself, especially if bound checking should be disabled for performance.

WillAyd

Thanks for taking a look; on first pass I think this code is essentially the same thing though?

Do you know which array access / index is actually causing the OOB error?

WillAyd · 2020-03-19T23:30:38Z

pandas/_libs/lib.pyx

@@ -775,7 +775,7 @@ def get_level_sorter(const int64_t[:] label, const int64_t[:] starts):
    return out


-@cython.boundscheck(False)
+@cython.boundscheck(True)


Not sure if you disabled this temporarily for debugging purposes but want to revert this; this has performance implications re-enabling

I disabled this to get a reproducible error + traceback, rather than random segfaults/bus errors.
On the other hand, it is still very easy to do out-of-bounds access in count_level_2d. #21824 still does, this PR does not fix that issue.

Ideally, the negative indexes should be handled in count_level_2d, so that disabling bounds checking does not lead to random segfaults.

tv3141 · 2020-03-20T13:46:32Z

Thanks for taking a look; on first pass I think this code is essentially the same thing though?

Do you know which array access / index is actually causing the OOB error?

labels contains -1s when the groupby column contains NaNs.

The calling function count() creates mask for -1s, and count_level_2d would ignore masked values. This changed with d968aab#diff-8fa7422077eecd07a4006c11c78fc93aL1265-L1268, since then out-of-bounds access on counts can happen.

jreback

looks fine, merge master to fix the docs fails, also comment. ping on green.

jreback · 2020-03-21T20:51:57Z

pandas/core/frame.py

@@ -7820,13 +7820,21 @@ def _count_level(self, level, axis=0, numeric_only=False):
                f"Can only count levels on hierarchical {self._get_axis_name(axis)}."
            )

+        # Mask NaNs: Mask rows where the index level is NaN and all values in
+        # the DataFrame that are NaN


you can remove this if/else as they are the same

jreback

actually need a whatsnew note, bug fixes for groupby.

pandas/_libs/lib.pyx

tv3141 · 2020-03-26T18:30:02Z

pandas/core/frame.py

-            return result
+            result = DataFrame(counts, index=level_index, columns=agg_axis)
+
+        return result


I amended the masking logic in this function to mask all rows/columns that lib.count_level_2d requires to be masked to avoid out-of-bounds access.

tv3141 · 2020-03-26T19:12:50Z

ping @jreback

jreback

lgtm. @jbrockmendel if any comments.

WillAyd · 2020-03-27T21:38:53Z

@tv3141 can you fix merge conflict in whatsnew? Otherwise lgtm as well

jbrockmendel · 2020-03-27T22:50:38Z

pandas/core/frame.py

-            mask = notna(frame.values)
+        # Mask NaNs: Mask rows or columns where the index level is NaN, and all
+        # values in the DataFrame that are NaN
+        values_mask = notna(frame.values)


is the existing version with notna(frame).values wrong? avoiding .values with is_mixed_type is worthwhile

@jbrockmendel I understood @jreback 's earlier comment such that notna(frame.values) and notna(frame).values are the same.

those two are in fact the same, but performance should be better for notna(frame).values

Running some tests shows that notna(df).values is indeed much faster on larger mixed type DataFrames. I will restore the original code.
I guess this optimisation would be better in notna.

In [1]: import pandas as pd In [2]: from pandas.core.dtypes.missing import notna In [3]: df = pd.DataFrame({'a': range(10), 'b': ['foo'] * 10}) In [4]: %timeit notna(df.values) 113 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [5]: %timeit notna(df).values 546 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [6]: df = pd.DataFrame({'a': range(1000000), 'b': ['foo'] * 1000000}) In [7]: %timeit notna(df.values) 163 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [8]: %timeit notna(df).values 40.7 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [9]: %timeit notna(df.values) 158 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [10]: %timeit notna(df).values 39.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WillAyd · 2020-04-03T20:24:08Z

@jbrockmendel any comments outstanding on this?

jreback · 2020-04-03T20:46:29Z

this looks good. i think ok to merge.

jbrockmendel · 2020-04-03T23:16:33Z

no objections here

WillAyd · 2020-04-04T00:01:54Z

Great thanks @tv3141

)

tv3141 commented Mar 19, 2020

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

WillAyd reviewed Mar 19, 2020

View reviewed changes

WillAyd added Groupby Segfault Non-Recoverable Error labels Mar 19, 2020

jreback requested changes Mar 21, 2020

View reviewed changes

jreback changed the title ~~Fix count segfault~~ Fix groupby.count segfault Mar 21, 2020

tv3141 added 9 commits March 23, 2020 21:24

Enable bounds checking.

13be18e

Add failing tests.

36513c3

Count only masked (non-na) values.

d6df0f0

Mask NaNs in index and dataframe values.

049b897

Clean up tests.

1e755b3

Re-disable bounds checking.

de75248

Move test into right module.

3bea5af

Fix masking logic to avoid SegFaults with DataFrame.count().

432cd68

black

7e379b1

tv3141 force-pushed the fix_count_segfault branch from 27b1ca5 to 80ae327 Compare March 23, 2020 22:39

tv3141 changed the title ~~Fix groupby.count segfault~~ BUG: Fix segfault in GroupBy.count and DataFrame.count Mar 23, 2020

Add info to whatsnew.

a660d0f

tv3141 force-pushed the fix_count_segfault branch from 80ae327 to a660d0f Compare March 24, 2020 19:10

Fix issue with default int32 on Windows.

52a6f99

WillAyd reviewed Mar 25, 2020

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

tv3141 commented Mar 26, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fix_count_segfault

6c75c89

jreback approved these changes Mar 26, 2020

View reviewed changes

WillAyd added this to the 1.1 milestone Mar 27, 2020

jbrockmendel reviewed Mar 27, 2020

View reviewed changes

tv3141 added 2 commits March 29, 2020 16:53

Merge remote-tracking branch 'upstream/master' into fix_count_segfault

375a4d1

Re-add notna optimisation.

60289c2

WillAyd merged commit d88b90d into pandas-dev:master Apr 4, 2020

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Apr 6, 2020

BUG: Fix segfault in GroupBy.count and DataFrame.count (pandas-dev#32842

8cf6c3b

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix segfault in GroupBy.count and DataFrame.count #32842

BUG: Fix segfault in GroupBy.count and DataFrame.count #32842

tv3141 commented Mar 19, 2020 •

edited

Loading

pep8speaks commented Mar 19, 2020 •

edited

Loading

tv3141 commented Mar 19, 2020

WillAyd left a comment

WillAyd Mar 19, 2020

tv3141 Mar 20, 2020

tv3141 commented Mar 20, 2020

jreback left a comment

jreback Mar 21, 2020

jreback left a comment

tv3141 Mar 26, 2020

tv3141 commented Mar 26, 2020

jreback left a comment

WillAyd commented Mar 27, 2020

jbrockmendel Mar 27, 2020

tv3141 Mar 29, 2020

jbrockmendel Mar 30, 2020

tv3141 Mar 30, 2020

WillAyd commented Apr 3, 2020

jreback commented Apr 3, 2020

jbrockmendel commented Apr 3, 2020

WillAyd commented Apr 4, 2020

BUG: Fix segfault in GroupBy.count and DataFrame.count #32842

BUG: Fix segfault in GroupBy.count and DataFrame.count #32842

Conversation

tv3141 commented Mar 19, 2020 • edited Loading

pep8speaks commented Mar 19, 2020 • edited Loading

Comment last updated at 2020-03-29 16:22:27 UTC

tv3141 commented Mar 19, 2020

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tv3141 commented Mar 20, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tv3141 commented Mar 26, 2020

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Mar 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Apr 3, 2020

jreback commented Apr 3, 2020

jbrockmendel commented Apr 3, 2020

WillAyd commented Apr 4, 2020

tv3141 commented Mar 19, 2020 •

edited

Loading

pep8speaks commented Mar 19, 2020 •

edited

Loading