Skip to content

SegmentationFault/BusError: Groupby + count #32841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tv3141 opened this issue Mar 19, 2020 · 0 comments · Fixed by #32842
Closed

SegmentationFault/BusError: Groupby + count #32841

tv3141 opened this issue Mar 19, 2020 · 0 comments · Fixed by #32842
Milestone

Comments

@tv3141
Copy link
Contributor

tv3141 commented Mar 19, 2020

Code Sample

import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [0, 1, np.NaN], "B": [1, 2, 3]})
df.groupby(["A"]).count()

Problem description

This randomly causes a SegFault or Bus error.

With bounds checking enabled I get the following IndexError.

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L778

@cython.boundscheck(True)
@cython.wraparound(False)
def count_level_2d(ndarray[uint8_t, ndim=2, cast=True] mask,
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/pandas/pandas/core/groupby/generic.py", line 1784, in count
    blocks = [make_block(val, placement=loc) for val, loc in zip(counted, locs)]
  File "/pandas/pandas/core/groupby/generic.py", line 1784, in <listcomp>
    blocks = [make_block(val, placement=loc) for val, loc in zip(counted, locs)]
  File "/pandas/pandas/core/groupby/generic.py", line 1782, in <genexpr>
    lib.count_level_2d(x, labels=ids, max_bin=ngroups, axis=1) for x in vals
  File "pandas/_libs/lib.pyx", line 803, in pandas._libs.lib.count_level_2d
    counts[i, labels[j]] += mask[i, j]
IndexError: Out of bounds on buffer access (axis 1)

Expected Output

     B
A     
0.0  1
1.0  1

Note

This issue is related to #21824. The Segfault happens in the same function, but the call stack is different.

Code sample from #21824:

import numpy as  np
import pandas as pd

df = pd.DataFrame({"Person": ["John", "Myla", None, "John", "Myla"],
                   "Age": [24., np.nan, 21., 33, 26],
                   "Single": [False, True, True, True, False]})
                  
res = df.set_index(["Person", "Single"]).count(level="Person")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/pandas/pandas/core/frame.py", line 7786, in count
    return self._count_level(level, axis=axis, numeric_only=numeric_only)
  File "/pandas/pandas/core/frame.py", line 7842, in _count_level
    counts = lib.count_level_2d(mask, level_codes, len(level_index), axis=0)
  File "pandas/_libs/lib.pyx", line 796, in pandas._libs.lib.count_level_2d
    counts[labels[i], j] += mask[i, j]
IndexError: Out of bounds on buffer access (axis 0)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 981252e7adb74b3aca2d9dfd075b0b64c3552975
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.0.dev0+870.g981252e7a
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0.post20200311
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.6.0
sphinx : 2.4.4
blosc : 1.8.3
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : 0.3.3
gcsfs : None
matplotlib : 3.2.0
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.15.0
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants