-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ERR: Segfault with df.astype(category) on empty dataframe #18004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
astype(category)
on empty dataframe
thanks @jcrist yep, this should raise |
I can confirm this on master with some slightly simpler code: pd.DataFrame(columns=['x', 'y']).astype('category') I get the segfault when 2+ columns are specified in the example above. If only one column is specified I get a |
Looks like this boils down to In the case of this specific issue, what ultimately is happening is something like this: In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])
In [4]: pd.factorize(arr)
Segmentation fault (core dumped) This isn't specific to factorize though, and seems to impact functions that rely on hash tables, e.g. In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])
In [4]: pd.unique(arr)
Segmentation fault (core dumped) Note that this isn't segfaulting for integer or float dtypes: In [3]: arr = np.array([np.array([], dtype='int64'), np.array([], dtype='int64')])
In [4]: pd.factorize(arr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-65d36072b155> in <module>()
----> 1 pd.factorize(arr)
/usr/local/lib/python3.4/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
558 uniques = vec_klass()
559 check_nulls = not is_integer_dtype(original)
--> 560 labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
561
562 labels = _ensure_platform_int(labels)
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_labels (pandas/_libs/hashtable.c:15265)()
ValueError: Buffer has wrong number of dimensions (expected 1, got 2) Following the error above, it looks like there's a template approach for int/float dtype hashtables but |
The crash is at inference.pyx#L361, where Could be due to: >>> len(np.array([[],[]]))
2
>>> len(np.array([[],[]]).ravel())
0 Possible fix: move diff --git a/pandas/_libs/src/inference.pyx b/pandas/_libs/src/inference.pyx
index b0a64e1cc..c340e870e 100644
--- a/pandas/_libs/src/inference.pyx
+++ b/pandas/_libs/src/inference.pyx
@@ -349,13 +349,13 @@ def infer_dtype(object value, bint skipna=False):
if values.dtype != np.object_:
values = values.astype('O')
+ # make contiguous
+ values = values.ravel()
+
n = len(values)
if n == 0:
return 'empty'
- # make contiguous
- values = values.ravel()
-
# try to use a valid value
for i in range(n):
val = util.get_value_1d(values, i)
Raises |
Thanks @cgohlke! Fix looks good and passes all tests locally for me. |
Pandas segfaults when calling
DataFrame.astype('category')
on an empty dataframe. This fails in0.21.0rc1
,0.20.3
, and probably previous versions as well.For non-empty frames, an error message is raised saying this operation isn't supported yet. Also note that a copy is needed to cause the segfault, without it the error message is still raised.
pd.show_versions:
The text was updated successfully, but these errors were encountered: