Skip to content

BUG: Convert tuple to list before _list_to_arrays when construct DataFrame. #25731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 5, 2019
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,7 @@ Reshaping
- :func:`to_records` now accepts dtypes to its `column_dtypes` parameter (:issue:`24895`)
- Bug in :func:`concat` where order of ``OrderedDict`` (and ``dict`` in Python 3.6+) is not respected, when passed in as ``objs`` argument (:issue:`21510`)
- Bug in :func:`concat` where the resulting ``freq`` of two :class:`DatetimeIndex` with the same ``freq`` would be dropped (:issue:`3232`).
- Bug in :class:`DataFrame` constructor when passing non-empty tuples would cause a segmentation fault (:issue:`25691`)

Sparse
^^^^^^
Expand Down
28 changes: 22 additions & 6 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2245,7 +2245,7 @@ def map_infer(ndarray arr, object f, bint convert=1):
return result


def to_object_array(rows: object, int min_width=0):
def to_object_array(rows: list, int min_width=0):
"""
Convert a list of lists into an object array.

Expand All @@ -2266,22 +2266,20 @@ def to_object_array(rows: object, int min_width=0):
cdef:
Py_ssize_t i, j, n, k, tmp
ndarray[object, ndim=2] result
list input_rows
list row

input_rows = <list>rows
n = len(input_rows)
n = len(rows)

k = min_width
for i in range(n):
tmp = len(input_rows[i])
tmp = len(rows[i])
if tmp > k:
k = tmp

result = np.empty((n, k), dtype=object)

for i in range(n):
row = list(input_rows[i])
row = list(rows[i])

for j in range(len(row)):
result[i, j] = row[j]
Expand All @@ -2307,6 +2305,24 @@ def tuples_to_object_array(ndarray[object] tuples):


def to_object_array_tuples(rows: list):
"""
Convert a list of tuples into an object array. Any subclass of
tuple in `rows` will be casted to tuple.

Parameters
----------
rows : 2-d array (N, K)
A list of tuples to be converted into an array
min_width : int
The minimum width of the object array. If a tuple
in `rows` contains fewer than `width` elements,
the remaining elements in the corresponding row
will all be `NaN`.

Returns
-------
obj_array : numpy array of the object dtype
"""
cdef:
Py_ssize_t i, j, n, k, tmp
ndarray[object, ndim=2] result
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/internals/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
if len(data) > 0 and isinstance(data[0], tuple):
content = list(lib.to_object_array_tuples(data).T)
content = list(lib.to_object_array_tuples(list(data)).T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this down even futher into to_object_array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: to_object_array_tuples is used to convert list of tuples, and to_object_array is used to convert list of lists. How could these two case be unified to to_object_array ?

The tuple (or list) needs to be converted to list before send to to_object_array_tuples, otherwise we will see incorrect type error for pd.DataFrame(((),())).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is push the listifying down to these 2 routines, in fact to_object_array already has this, need to add similar in to_object_array_tuples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the segmentation fault happens in #25691 is caused by to_object_array, rather than to_object_array_tuples. The input_rows = <list>rows does nothing than an assignment:

  /* "pandas/_libs/lib.pyx":2272
 *         list row
 * 
 *     input_rows = <list>rows             # <<<<<<<<<<<<<<
 *     n = len(input_rows)
 * 
 */
  __pyx_t_1 = __pyx_v_rows;
  __Pyx_INCREF(__pyx_t_1);
  __pyx_v_input_rows = ((PyObject*)__pyx_t_1);
  __pyx_t_1 = 0;

The rows and input_rows are both PyObject *. The whole story is, when we invoke to_object_array with ([], []), we use list's [](__Pyx_GetItemInt_List_Fast), it results a null pointer. Then we Py_INCREF it, thus the segmentation fault.

I have revised the patch to remove the unused conversion in to_object_array and restrict its argument type to list. Another possible solution should be change the type of argument rows to object from list the both two methods, but I think lists is used more commonly to construct DataFrame than tuples. __Pyx_GetItemInt_List_Fast should be faster than __Pyx_GetItemInt_Generic and the conversion from tuple to list in python side is fairly ok.

The backtrace of the segmentation fault
>>> lib.to_object_array(([], []))

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829
55829           Py_INCREF(r);
(gdb) bt
#0  __Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829
#1  __pyx_pf_6pandas_5_libs_3lib_102to_object_array (__pyx_self=<optimized out>, __pyx_v_min_width=<optimized out>,
    __pyx_v_rows=([], [])) at pandas/_libs/lib.c:30958
#2  __pyx_pw_6pandas_5_libs_3lib_103to_object_array (__pyx_self=<optimized out>, __pyx_args=<optimized out>,
    __pyx_kwds=<optimized out>) at pandas/_libs/lib.c:30855

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, IIRC this was put in here pretty specifically, can you see why that was? in any event, I believe you can simply type the input arg to to_object_array_tuples no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In to_object_array_tuples, if the element in rows is not a tuple, it will be cast to a tuple.

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L2325-L2337

However I still think it make sense to treat list of lists and list of tuples differently and dispatch them to different methods to avoid the overhead caused by casting list to tuple. I think list of lists may be used more frequently. I have add a comment to to_object_array_tuples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sighingnow I agree with your point, I am not suggesting combing these routines at all, rather inside each routine the data is first cast to a list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

else:
# list of lists
content = list(lib.to_object_array(data).T)
content = list(lib.to_object_array(list(data)).T)
return _convert_object_array(content, columns, dtype=dtype,
coerce_float=coerce_float)

Expand Down
18 changes: 16 additions & 2 deletions pandas/tests/frame/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -1187,12 +1187,26 @@ def test_constructor_mixed_type_rows(self):
expected = DataFrame([[1, 2], [3, 4]])
tm.assert_frame_equal(result, expected)

def test_constructor_tuples(self):
@pytest.mark.parametrize("tuples,lists", [
((), []),
((()), []),
(((), ()), [(), ()]),
(((), ()), [[], []]),
(([], []), [[], []]),
(([1, 2, 3], [4, 5, 6]), [[1, 2, 3], [4, 5, 6]])
])
def test_constructor_tuple(self, tuples, lists):
# GH 25691
result = DataFrame(tuples)
expected = DataFrame(lists)
tm.assert_frame_equal(result, expected)

def test_constructor_list_of_tuples(self):
result = DataFrame({'A': [(1, 2), (3, 4)]})
expected = DataFrame({'A': Series([(1, 2), (3, 4)])})
tm.assert_frame_equal(result, expected)

def test_constructor_namedtuples(self):
def test_constructor_list_of_namedtuples(self):
# GH11181
from collections import namedtuple
named_tuple = namedtuple("Pandas", list('ab'))
Expand Down