Skip to content

ERR: Segfault with df.astype(category) on empty dataframe #18004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcrist opened this issue Oct 27, 2017 · 5 comments · Fixed by #18015
Closed

ERR: Segfault with df.astype(category) on empty dataframe #18004

jcrist opened this issue Oct 27, 2017 · 5 comments · Fixed by #18015
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Milestone

Comments

@jcrist
Copy link
Contributor

jcrist commented Oct 27, 2017

Pandas segfaults when calling DataFrame.astype('category') on an empty dataframe. This fails in 0.21.0rc1, 0.20.3, and probably previous versions as well.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': ['a', 'b', 'c'], 'z': ['a', 'b', 'c']})

In [3]: empty = df.iloc[:0].copy()  # copy is necessary, no segfault without it

In [4]: empty.astype('category')
Segmentation fault: 11

For non-empty frames, an error message is raised saying this operation isn't supported yet. Also note that a copy is needed to cause the segfault, without it the error message is still raised.

pd.show_versions:
In [5]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Darwin
OS-release: 16.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.1.0
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.5.3
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.1
pandas_gbq: None
pandas_datareader: None
@jreback jreback added Error Reporting Incorrect or improved errors from pandas Categorical Categorical Data Type labels Oct 27, 2017
@jreback jreback changed the title Segfault with astype(category) on empty dataframe ERR: Segfault with df.astype(category) on empty dataframe Oct 27, 2017
@jreback jreback added this to the 0.21.1 milestone Oct 27, 2017
@jreback
Copy link
Contributor

jreback commented Oct 27, 2017

thanks @jcrist yep, this should raise

@jschendel
Copy link
Member

I can confirm this on master with some slightly simpler code:

pd.DataFrame(columns=['x', 'y']).astype('category')

I get the segfault when 2+ columns are specified in the example above. If only one column is specified I get a NotImplementedError.

@jschendel
Copy link
Member

jschendel commented Oct 27, 2017

Looks like this boils down to factorize in this specific case, and the hash table code in general.

In the case of this specific issue, what ultimately is happening is something like this:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.factorize(arr)
Segmentation fault (core dumped)

This isn't specific to factorize though, and seems to impact functions that rely on hash tables, e.g. unique:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.unique(arr)
Segmentation fault (core dumped)

Note that this isn't segfaulting for integer or float dtypes:

In [3]: arr = np.array([np.array([], dtype='int64'), np.array([], dtype='int64')])

In [4]: pd.factorize(arr)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-65d36072b155> in <module>()
----> 1 pd.factorize(arr)

/usr/local/lib/python3.4/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    558     uniques = vec_klass()
    559     check_nulls = not is_integer_dtype(original)
--> 560     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    561
    562     labels = _ensure_platform_int(labels)

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_labels (pandas/_libs/hashtable.c:15265)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Following the error above, it looks like there's a template approach for int/float dtype hashtables but StringHashTable and PyObjectHashTable have their own custom code, which I'm guessing will need to be patched to raise a ValueError similar to the one above? Probably a bit beyond my current knowledge level.

@cgohlke
Copy link
Contributor

cgohlke commented Oct 27, 2017

The crash is at inference.pyx#L361, where values is an empty array that should not be indexed.

Could be due to:

>>> len(np.array([[],[]]))
2
>>> len(np.array([[],[]]).ravel())
0

Possible fix: move n = len(values) after values = values.ravel():

diff --git a/pandas/_libs/src/inference.pyx b/pandas/_libs/src/inference.pyx
index b0a64e1cc..c340e870e 100644
--- a/pandas/_libs/src/inference.pyx
+++ b/pandas/_libs/src/inference.pyx
@@ -349,13 +349,13 @@ def infer_dtype(object value, bint skipna=False):
     if values.dtype != np.object_:
         values = values.astype('O')

+    # make contiguous
+    values = values.ravel()
+
     n = len(values)
     if n == 0:
         return 'empty'

-    # make contiguous
-    values = values.ravel()
-
     # try to use a valid value
     for i in range(n):
         val = util.get_value_1d(values, i)

Raises ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

@jschendel
Copy link
Member

Thanks @cgohlke! Fix looks good and passes all tests locally for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants