Skip to content

BUG: Fix segfault in GroupBy.count and DataFrame.count #32842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 4, 2020

Conversation

tv3141
Copy link
Contributor

@tv3141 tv3141 commented Mar 19, 2020

@pep8speaks
Copy link

pep8speaks commented Mar 19, 2020

Hello @tv3141! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-29 16:22:27 UTC

@tv3141
Copy link
Contributor Author

tv3141 commented Mar 19, 2020

So far, this PR does not fix the related #21824.

One option would be to mask rows -1 as well in the calling code as well. But I think it would be good to prevent SegFaults in count_level_2d itself, especially if bound checking should be disabled for performance.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look; on first pass I think this code is essentially the same thing though?

Do you know which array access / index is actually causing the OOB error?

@@ -775,7 +775,7 @@ def get_level_sorter(const int64_t[:] label, const int64_t[:] starts):
return out


@cython.boundscheck(False)
@cython.boundscheck(True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you disabled this temporarily for debugging purposes but want to revert this; this has performance implications re-enabling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled this to get a reproducible error + traceback, rather than random segfaults/bus errors.
On the other hand, it is still very easy to do out-of-bounds access in count_level_2d. #21824 still does, this PR does not fix that issue.

Ideally, the negative indexes should be handled in count_level_2d, so that disabling bounds checking does not lead to random segfaults.

@WillAyd WillAyd added Groupby Segfault Non-Recoverable Error labels Mar 19, 2020
@tv3141
Copy link
Contributor Author

tv3141 commented Mar 20, 2020

Thanks for taking a look; on first pass I think this code is essentially the same thing though?

Do you know which array access / index is actually causing the OOB error?

labels contains -1s when the groupby column contains NaNs.

The calling function count() creates mask for -1s, and count_level_2d would ignore masked values. This changed with d968aab#diff-8fa7422077eecd07a4006c11c78fc93aL1265-L1268, since then out-of-bounds access on counts can happen.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, merge master to fix the docs fails, also comment. ping on green.

@@ -7820,13 +7820,21 @@ def _count_level(self, level, axis=0, numeric_only=False):
f"Can only count levels on hierarchical {self._get_axis_name(axis)}."
)

# Mask NaNs: Mask rows where the index level is NaN and all values in
# the DataFrame that are NaN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this if/else as they are the same

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually need a whatsnew note, bug fixes for groupby.

@jreback jreback changed the title Fix count segfault Fix groupby.count segfault Mar 21, 2020
@tv3141 tv3141 force-pushed the fix_count_segfault branch from 27b1ca5 to 80ae327 Compare March 23, 2020 22:39
@tv3141 tv3141 changed the title Fix groupby.count segfault BUG: Fix segfault in GroupBy.count and DataFrame.count Mar 23, 2020
@tv3141 tv3141 force-pushed the fix_count_segfault branch from 80ae327 to a660d0f Compare March 24, 2020 19:10
return result
result = DataFrame(counts, index=level_index, columns=agg_axis)

return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended the masking logic in this function to mask all rows/columns that lib.count_level_2d requires to be masked to avoid out-of-bounds access.

@tv3141
Copy link
Contributor Author

tv3141 commented Mar 26, 2020

ping @jreback

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @jbrockmendel if any comments.

@WillAyd
Copy link
Member

WillAyd commented Mar 27, 2020

@tv3141 can you fix merge conflict in whatsnew? Otherwise lgtm as well

@WillAyd WillAyd added this to the 1.1 milestone Mar 27, 2020
mask = notna(frame.values)
# Mask NaNs: Mask rows or columns where the index level is NaN, and all
# values in the DataFrame that are NaN
values_mask = notna(frame.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the existing version with notna(frame).values wrong? avoiding .values with is_mixed_type is worthwhile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I understood @jreback 's earlier comment such that notna(frame.values) and notna(frame).values are the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those two are in fact the same, but performance should be better for notna(frame).values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running some tests shows that notna(df).values is indeed much faster on larger mixed type DataFrames. I will restore the original code.
I guess this optimisation would be better in notna.

In [1]: import pandas as pd                                                                                                                                                                   
In [2]: from pandas.core.dtypes.missing import notna                                                                                                                               
In [3]: df = pd.DataFrame({'a': range(10), 'b': ['foo'] * 10})                                                                                                                                     
In [4]: %timeit notna(df.values)                                                                                                                                                              
113 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit notna(df).values                                                                                                                                                              
546 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: df = pd.DataFrame({'a': range(1000000), 'b': ['foo'] * 1000000})                                                                                                                      
In [7]: %timeit notna(df.values)                                                                                                                                                              
163 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit notna(df).values                                                                                                                                                              
40.7 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit notna(df.values)                                                                                                                                                              
158 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit notna(df).values                                                                                                                                                             
39.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@WillAyd
Copy link
Member

WillAyd commented Apr 3, 2020

@jbrockmendel any comments outstanding on this?

@jreback
Copy link
Contributor

jreback commented Apr 3, 2020

this looks good. i think ok to merge.

@jbrockmendel
Copy link
Member

no objections here

@WillAyd WillAyd merged commit d88b90d into pandas-dev:master Apr 4, 2020
@WillAyd
Copy link
Member

WillAyd commented Apr 4, 2020

Great thanks @tv3141

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SegmentationFault/BusError: Groupby + count Segfault on clean-up with count example from docstrings
5 participants