-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix segfault in GroupBy.count and DataFrame.count #32842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So far, this PR does not fix the related #21824. One option would be to mask rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look; on first pass I think this code is essentially the same thing though?
Do you know which array access / index is actually causing the OOB error?
pandas/_libs/lib.pyx
Outdated
@@ -775,7 +775,7 @@ def get_level_sorter(const int64_t[:] label, const int64_t[:] starts): | |||
return out | |||
|
|||
|
|||
@cython.boundscheck(False) | |||
@cython.boundscheck(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you disabled this temporarily for debugging purposes but want to revert this; this has performance implications re-enabling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled this to get a reproducible error + traceback, rather than random segfaults/bus errors.
On the other hand, it is still very easy to do out-of-bounds access in count_level_2d
. #21824 still does, this PR does not fix that issue.
Ideally, the negative indexes should be handled in count_level_2d
, so that disabling bounds checking does not lead to random segfaults.
The calling function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, merge master to fix the docs fails, also comment. ping on green.
pandas/core/frame.py
Outdated
@@ -7820,13 +7820,21 @@ def _count_level(self, level, axis=0, numeric_only=False): | |||
f"Can only count levels on hierarchical {self._get_axis_name(axis)}." | |||
) | |||
|
|||
# Mask NaNs: Mask rows where the index level is NaN and all values in | |||
# the DataFrame that are NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this if/else as they are the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually need a whatsnew note, bug fixes for groupby.
27b1ca5
to
80ae327
Compare
80ae327
to
a660d0f
Compare
return result | ||
result = DataFrame(counts, index=level_index, columns=agg_axis) | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I amended the masking logic in this function to mask all rows/columns that lib.count_level_2d
requires to be masked to avoid out-of-bounds access.
ping @jreback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @jbrockmendel if any comments.
@tv3141 can you fix merge conflict in whatsnew? Otherwise lgtm as well |
pandas/core/frame.py
Outdated
mask = notna(frame.values) | ||
# Mask NaNs: Mask rows or columns where the index level is NaN, and all | ||
# values in the DataFrame that are NaN | ||
values_mask = notna(frame.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the existing version with notna(frame).values
wrong? avoiding .values with is_mixed_type is worthwhile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel I understood @jreback 's earlier comment such that notna(frame.values)
and notna(frame).values
are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those two are in fact the same, but performance should be better for notna(frame).values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running some tests shows that notna(df).values
is indeed much faster on larger mixed type DataFrames
. I will restore the original code.
I guess this optimisation would be better in notna
.
In [1]: import pandas as pd
In [2]: from pandas.core.dtypes.missing import notna
In [3]: df = pd.DataFrame({'a': range(10), 'b': ['foo'] * 10})
In [4]: %timeit notna(df.values)
113 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit notna(df).values
546 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: df = pd.DataFrame({'a': range(1000000), 'b': ['foo'] * 1000000})
In [7]: %timeit notna(df.values)
163 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit notna(df).values
40.7 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit notna(df.values)
158 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit notna(df).values
39.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@jbrockmendel any comments outstanding on this? |
this looks good. i think ok to merge. |
no objections here |
Great thanks @tv3141 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff