ERR: Segfault with df.astype(category) on empty dataframe #18004

jcrist · 2017-10-27T16:05:51Z

Pandas segfaults when calling DataFrame.astype('category') on an empty dataframe. This fails in 0.21.0rc1, 0.20.3, and probably previous versions as well.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': ['a', 'b', 'c'], 'z': ['a', 'b', 'c']})

In [3]: empty = df.iloc[:0].copy()  # copy is necessary, no segfault without it

In [4]: empty.astype('category')
Segmentation fault: 11

For non-empty frames, an error message is raised saying this operation isn't supported yet. Also note that a copy is needed to cause the segfault, without it the error message is still raised.

pd.show_versions:

In [5]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Darwin
OS-release: 16.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.1.0
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.5.3
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.1
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-10-27T16:57:24Z

thanks @jcrist yep, this should raise

jschendel · 2017-10-27T17:28:36Z

I can confirm this on master with some slightly simpler code:

pd.DataFrame(columns=['x', 'y']).astype('category')

I get the segfault when 2+ columns are specified in the example above. If only one column is specified I get a NotImplementedError.

jschendel · 2017-10-27T19:10:05Z

Looks like this boils down to factorize in this specific case, and the hash table code in general.

In the case of this specific issue, what ultimately is happening is something like this:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.factorize(arr)
Segmentation fault (core dumped)

This isn't specific to factorize though, and seems to impact functions that rely on hash tables, e.g. unique:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.unique(arr)
Segmentation fault (core dumped)

Note that this isn't segfaulting for integer or float dtypes:

In [3]: arr = np.array([np.array([], dtype='int64'), np.array([], dtype='int64')])

In [4]: pd.factorize(arr)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-65d36072b155> in <module>()
----> 1 pd.factorize(arr)

/usr/local/lib/python3.4/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    558     uniques = vec_klass()
    559     check_nulls = not is_integer_dtype(original)
--> 560     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    561
    562     labels = _ensure_platform_int(labels)

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_labels (pandas/_libs/hashtable.c:15265)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Following the error above, it looks like there's a template approach for int/float dtype hashtables but StringHashTable and PyObjectHashTable have their own custom code, which I'm guessing will need to be patched to raise a ValueError similar to the one above? Probably a bit beyond my current knowledge level.

cgohlke · 2017-10-27T20:21:58Z

The crash is at inference.pyx#L361, where values is an empty array that should not be indexed.

Could be due to:

>>> len(np.array([[],[]]))
2
>>> len(np.array([[],[]]).ravel())
0

Possible fix: move n = len(values) after values = values.ravel():

diff --git a/pandas/_libs/src/inference.pyx b/pandas/_libs/src/inference.pyx
index b0a64e1cc..c340e870e 100644
--- a/pandas/_libs/src/inference.pyx
+++ b/pandas/_libs/src/inference.pyx
@@ -349,13 +349,13 @@ def infer_dtype(object value, bint skipna=False):
     if values.dtype != np.object_:
         values = values.astype('O')

+    # make contiguous
+    values = values.ravel()
+
     n = len(values)
     if n == 0:
         return 'empty'

-    # make contiguous
-    values = values.ravel()
-
     # try to use a valid value
     for i in range(n):
         val = util.get_value_1d(values, i)

Raises ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

jschendel · 2017-10-28T18:57:58Z

Thanks @cgohlke! Fix looks good and passes all tests locally for me.

jreback added Error Reporting Incorrect or improved errors from pandas Categorical Categorical Data Type labels Oct 27, 2017

jreback changed the title ~~Segfault with astype(category) on empty dataframe~~ ERR: Segfault with df.astype(category) on empty dataframe Oct 27, 2017

jreback added this to the 0.21.1 milestone Oct 27, 2017

jreback added Difficulty Intermediate labels Oct 27, 2017

jschendel mentioned this issue Oct 28, 2017

ERR: Fix segfault with .astype('category') on empty DataFrame #18015

Merged

4 tasks

jreback closed this as completed in #18015 Oct 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: Segfault with df.astype(category) on empty dataframe #18004

ERR: Segfault with df.astype(category) on empty dataframe #18004

jcrist commented Oct 27, 2017

jreback commented Oct 27, 2017

jschendel commented Oct 27, 2017

jschendel commented Oct 27, 2017 •

edited

Loading

cgohlke commented Oct 27, 2017 •

edited

Loading

jschendel commented Oct 28, 2017

ERR: Segfault with df.astype(category) on empty dataframe #18004

ERR: Segfault with df.astype(category) on empty dataframe #18004

Comments

jcrist commented Oct 27, 2017

jreback commented Oct 27, 2017

jschendel commented Oct 27, 2017

jschendel commented Oct 27, 2017 • edited Loading

cgohlke commented Oct 27, 2017 • edited Loading

jschendel commented Oct 28, 2017

jschendel commented Oct 27, 2017 •

edited

Loading

cgohlke commented Oct 27, 2017 •

edited

Loading