Skip to content

Fix segfault with printing dataframe #47097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.4.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ including other versions of pandas.

Fixed regressions
~~~~~~~~~~~~~~~~~
-
- Fixed regression in taking NULL :class:`objects` from a :class:`DataFrame` causing a segmentation violation. These NULL values are created by :meth:`numpy.empty_like` (:issue:`46848`)
-

.. ---------------------------------------------------------------------------
Expand Down
14 changes: 14 additions & 0 deletions pandas/_libs/algos_take_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -147,8 +147,15 @@ def take_2d_axis0_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values,
{{if c_type_in == "uint8_t" and c_type_out == "object"}}
out[i, j] = True if values[idx, j] > 0 else False
{{else}}
{{if c_type_in == "object"}} # GH46848
if values[idx][j] is None:
out[i, j] = None
else:
out[i, j] = values[idx, j]
{{else}}
out[i, j] = values[idx, j]
{{endif}}
{{endif}}


@cython.wraparound(False)
Expand Down Expand Up @@ -183,8 +190,15 @@ def take_2d_axis1_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values,
{{if c_type_in == "uint8_t" and c_type_out == "object"}}
out[i, j] = True if values[i, idx] > 0 else False
{{else}}
{{if c_type_in == "object"}} # GH46848
if values[i][idx] is None:
out[i, j] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea why this makes a difference? this looks deeply weird

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel i've posted the arguments that cause take_2d_axis1_object_object to segfault in #46848 (comment) if that helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps should explicitly check for NULL for clarity. see #46848 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps should explicitly check for NULL for clarity. see #46848 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fix here, probably should also patch take_2d_axis0_, see #46848 (comment)

@roberthdevries would changing L159 like done in https://github.com/pandas-dev/pandas/pull/13764/files also work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmpff, It seems empty_like does not fill the object array with None, but just ensures NULL'ing. NumPy supports this (mostly?), it is perfectly fine if an array is filled with NULL's, this is the same as if the array was filled with None.

It is probably best if you also add support for that in pandas. That said, it would also be better to explicitly fill the array with None in NumPy. (And I do think we should define that as the right thing, even if there are branches where NULL's must be expected a fully initialized array filled with NULLs should never happen!)

Copy link
Contributor

@seberg seberg Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To proof the point:

In [7]: arr = np.empty(3, dtype=object)

In [8]: arr.tobytes()
Out[8]: b'`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00'

In [9]: np.empty_like(arr).tobytes()
Out[9]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg How would you effectively detect NULL values in a numpy array? I assume that @simonjayhawkins prefers a fix somewhere in the setitem method of DataFrame, but I am at a loss how to intercept those values and convert them to None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, it seems to me that cython may not know about the possibility of NULLs for object typed buffers (either that way or typed memoryviews). The generated code doesn't look like it. So I am not sure what a decent way forward would be.

Copy link
Contributor

@seberg seberg Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a work-around that is relatively quick may be the following:

    cdef cnp.intp_t[2] idx
    idx[0] = i
    idx[1] = idx

    cdef void *data = cnp.PyArray_GetPtr(values, idx)
    if data == NULL:
        data = <void *>None
    out[i,j] = <object>data

But maybe its nicer to just drop that weird PyArray_GetPtr:

    cdef void *data = arr.data + x * arr.strides[0] + y * arr.strides[1]
    if data == NULL:
        data = <void *>None
    return <object>data

which will be as fast as possible without any micro-optimizations (not quite sure on Cython 3, but maybe it doesn't matter).

I am have planning on plugging this hole in NumPy (numpy/numpy#21817), but I am not sure if it is better there or in Cython.
(IMO, it is clearly right to fix this either in NumPy or in Cython, although if we fix it in NumPy, pandas still needs the work-around since you can rely on a new Cython, but not a new NumPy.)

else:
out[i, j] = values[i, idx]
{{else}}
out[i, j] = values[i, idx]
{{endif}}
{{endif}}


@cython.wraparound(False)
Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/test_take.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@

from pandas._libs import iNaT

import pandas as pd
import pandas._testing as tm
import pandas.core.algorithms as algos
from pandas.core.array_algos.take import _take_nd_ndarray


@pytest.fixture(params=[True, False])
Expand Down Expand Up @@ -332,3 +334,25 @@ def test_take_coerces_list(self):
result = algos.take(arr, [0, 0])
expected = np.array([1, 1])
tm.assert_numpy_array_equal(result, expected)


class TestTakeNumpyNull:
# GH 46848

def test_take_2d_axis0_object_object(self):
arr = np.array([[None]], dtype=object)
indexer = np.array([0])
axis = 1
fill_value = None
allow_fill = False

result = _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
assert result == [[None]]
arr2 = np.empty_like(arr)
result2 = _take_nd_ndarray(arr2, indexer, axis, fill_value, allow_fill)
assert result2 == [[None]]

def test_take_2d_axis1_object_object(self):
df = pd.DataFrame(np.empty((1, 1), dtype=object))
df[0] = np.empty_like(df[0])
df[[0]]