BUG: Convert tuple to list before `_list_to_arrays` when construct DataFrame. #25731

sighingnow · 2019-03-14T15:01:16Z

In _list_to_arrays, to_object_array_tuples or to_object_array is called, both require the argument to be a list (list of tuples or list of list).

closes BUG: Segmentation fault using tuple as iterator for DataFrame constructor #25691
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Signed-off-by: HE, Tao <[email protected]>

jreback · 2019-03-14T15:24:13Z

pandas/core/internals/construction.py

@@ -394,7 +394,7 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):
                return [[]] * len(columns), columns
        return [], []  # columns if columns is not None else []
    if isinstance(data[0], (list, tuple)):
-        return _list_to_arrays(data, columns, coerce_float=coerce_float,
+        return _list_to_arrays(list(data), columns, coerce_float=coerce_float,


better to put this in _list_to_arrays

jreback · 2019-03-14T15:24:24Z

doc/source/whatsnew/v0.25.0.rst

@@ -124,7 +124,7 @@ Bug Fixes
 ~~~~~~~~~
 - Bug in :func:`to_datetime` which would raise an (incorrect) ``ValueError`` when called with a date far into the future and the ``format`` argument specified instead of raising ``OutOfBoundsDatetime`` (:issue:`23830`)
 - Bug in an error message in :meth:`DataFrame.plot`. Improved the error message if non-numerics are passed to :meth:`DataFrame.plot` (:issue:`25481`)
-
+- Segmentation fault when construct :class:`DataFrame` from non-empty tuples (:issue:`25691`)


move this to Bug Fixes reshaping section

codecov · 2019-03-14T15:46:16Z

Codecov Report

Merging #25731 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25731      +/-   ##
==========================================
+ Coverage   91.24%   91.25%   +<.01%     
==========================================
  Files         172      172              
  Lines       52973    52973              
==========================================
+ Hits        48337    48338       +1     
+ Misses       4636     4635       -1

Flag	Coverage Δ
#multiple	`89.82% <100%> (ø)`	⬆️
#single	`41.74% <100%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals/construction.py	`95.9% <100%> (ø)`	⬆️
pandas/util/testing.py	`89.08% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5d251d...2f00c7f. Read the comment docs.

codecov · 2019-03-14T15:46:19Z

Codecov Report

Merging #25731 into master will increase coverage by 0.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25731      +/-   ##
==========================================
+ Coverage   91.77%   91.81%   +0.03%     
==========================================
  Files         175      175              
  Lines       52606    52580      -26     
==========================================
- Hits        48280    48274       -6     
+ Misses       4326     4306      -20

Flag	Coverage Δ
#multiple	`90.36% <ø> (+0.04%)`	⬆️
#single	`41.9% <ø> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/config_init.py	`96.96% <0%> (-2.24%)`	⬇️
pandas/core/computation/common.py	`89.47% <0%> (-0.53%)`	⬇️
pandas/compat/pickle_compat.py	`69.13% <0%> (-0.38%)`	⬇️
pandas/core/groupby/grouper.py	`98.16% <0%> (-0.37%)`	⬇️
pandas/compat/numpy/__init__.py	`92.85% <0%> (-0.25%)`	⬇️
pandas/plotting/_style.py	`77.17% <0%> (-0.25%)`	⬇️
pandas/core/computation/engines.py	`88.52% <0%> (-0.19%)`	⬇️
pandas/plotting/_timeseries.py	`65.28% <0%> (-0.18%)`	⬇️
pandas/io/excel/_util.py	`87.5% <0%> (-0.18%)`	⬇️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7721f70...4e41de3. Read the comment docs.

* master: Fixturize tests/frame/test_operators.py (pandas-dev#25641) Update ValueError message in corr (pandas-dev#25729) # Conflicts: # doc/source/whatsnew/v0.25.0.rst

Signed-off-by: HE, Tao <[email protected]>

jreback · 2019-03-14T17:09:22Z

pandas/core/internals/construction.py

@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

 def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
    if len(data) > 0 and isinstance(data[0], tuple):
-        content = list(lib.to_object_array_tuples(data).T)
+        content = list(lib.to_object_array_tuples(list(data)).T)


can you move this down even futher into to_object_array

Just to clarify: to_object_array_tuples is used to convert list of tuples, and to_object_array is used to convert list of lists. How could these two case be unified to to_object_array ?

The tuple (or list) needs to be converted to list before send to to_object_array_tuples, otherwise we will see incorrect type error for pd.DataFrame(((),())).

what i mean is push the listifying down to these 2 routines, in fact to_object_array already has this, need to add similar in to_object_array_tuples

In fact, the segmentation fault happens in #25691 is caused by to_object_array, rather than to_object_array_tuples. The input_rows = <list>rows does nothing than an assignment:

/* "pandas/_libs/lib.pyx":2272 * list row * * input_rows = <list>rows # <<<<<<<<<<<<<< * n = len(input_rows) * */ __pyx_t_1 = __pyx_v_rows; __Pyx_INCREF(__pyx_t_1); __pyx_v_input_rows = ((PyObject*)__pyx_t_1); __pyx_t_1 = 0;

The rows and input_rows are both PyObject *. The whole story is, when we invoke to_object_array with ([], []), we use list's [](__Pyx_GetItemInt_List_Fast), it results a null pointer. Then we Py_INCREF it, thus the segmentation fault.

I have revised the patch to remove the unused conversion in to_object_array and restrict its argument type to list. Another possible solution should be change the type of argument rows to object from list the both two methods, but I think lists is used more commonly to construct DataFrame than tuples. __Pyx_GetItemInt_List_Fast should be faster than __Pyx_GetItemInt_Generic and the conversion from tuple to list in python side is fairly ok.

The backtrace of the segmentation fault

>>> lib.to_object_array(([], [])) Thread 1 "python" received signal SIGSEGV, Segmentation fault. __Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829 55829 Py_INCREF(r); (gdb) bt #0 __Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829 #1 __pyx_pf_6pandas_5_libs_3lib_102to_object_array (__pyx_self=<optimized out>, __pyx_v_min_width=<optimized out>, __pyx_v_rows=([], [])) at pandas/_libs/lib.c:30958 #2 __pyx_pw_6pandas_5_libs_3lib_103to_object_array (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at pandas/_libs/lib.c:30855

hmm, IIRC this was put in here pretty specifically, can you see why that was? in any event, I believe you can simply type the input arg to to_object_array_tuples no?

In to_object_array_tuples, if the element in rows is not a tuple, it will be cast to a tuple.

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L2325-L2337

However I still think it make sense to treat list of lists and list of tuples differently and dispatch them to different methods to avoid the overhead caused by casting list to tuple. I think list of lists may be used more frequently. I have add a comment to to_object_array_tuples.

@sighingnow I agree with your point, I am not suggesting combing these routines at all, rather inside each routine the data is first cast to a list

jreback · 2019-03-15T11:37:48Z

pandas/core/internals/construction.py

@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

 def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
    if len(data) > 0 and isinstance(data[0], tuple):
-        content = list(lib.to_object_array_tuples(data).T)
+        content = list(lib.to_object_array_tuples(list(data)).T)


what i mean is push the listifying down to these 2 routines, in fact to_object_array already has this, need to add similar in to_object_array_tuples

jreback · 2019-03-18T12:09:21Z

pandas/core/internals/construction.py

@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

 def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
    if len(data) > 0 and isinstance(data[0], tuple):
-        content = list(lib.to_object_array_tuples(data).T)
+        content = list(lib.to_object_array_tuples(list(data)).T)


hmm, IIRC this was put in here pretty specifically, can you see why that was? in any event, I believe you can simply type the input arg to to_object_array_tuples no?

jreback · 2019-03-18T12:09:46Z

doc/source/whatsnew/v0.25.0.rst

@@ -245,6 +245,7 @@ Reshaping
 - Bug in :func:`merge` when merging by index name would sometimes result in an incorrectly numbered index (:issue:`24212`)
 - :func:`to_records` now accepts dtypes to its `column_dtypes` parameter (:issue:`24895`)
 - Bug in :func:`concat` where order of ``OrderedDict`` (and ``dict`` in Python 3.6+) is not respected, when passed in as  ``objs`` argument (:issue:`21510`)
+- Bug in :class:`DataFrame` construct when passing non-empty tuples would cause segmentation fault (:issue:`25691`)


Bug in :class:DataFrame constructor when passing non-empty tuples would cause a segmentation fault

jreback · 2019-03-26T00:22:04Z

can you merge master and update

sighingnow · 2019-03-26T04:02:07Z

Merged master.

pandas/_libs/lib.pyx

jreback · 2019-03-26T12:03:02Z

pandas/core/internals/construction.py

@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

 def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
    if len(data) > 0 and isinstance(data[0], tuple):
-        content = list(lib.to_object_array_tuples(data).T)
+        content = list(lib.to_object_array_tuples(list(data)).T)


@sighingnow I agree with your point, I am not suggesting combing these routines at all, rather inside each routine the data is first cast to a list

pandas/_libs/lib.pyx

jreback · 2019-03-29T12:12:39Z

pandas/_libs/lib.pyx

        tuple row

-    n = len(rows)
+    input_rows = list(rows)


leave this as

rows=list(rows) and remove input_rows

Done, and removed the input_rows in to_object_array as well.

jreback

lgtm ex the doc-string which is incorrect.

pandas/_libs/lib.pyx

jreback · 2019-04-05T00:45:31Z

thanks @sighingnow nice patch!

Convert tuple to list before _list_to_arrays when construct DataFrame.

2f00c7f

Signed-off-by: HE, Tao <[email protected]>

sighingnow changed the title ~~Convert tuple to list before _list_to_arrays when construct DataFrame.~~ BUG: Convert tuple to list before _list_to_arrays when construct DataFrame. Mar 14, 2019

jreback requested changes Mar 14, 2019

View reviewed changes

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 14, 2019

sighingnow added 2 commits March 15, 2019 00:12

Merge branch 'master' into fix-25691

17827de

* master: Fixturize tests/frame/test_operators.py (pandas-dev#25641) Update ValueError message in corr (pandas-dev#25729) # Conflicts: # doc/source/whatsnew/v0.25.0.rst

Revise.

e4e68ce

Signed-off-by: HE, Tao <[email protected]>

jreback reviewed Mar 14, 2019

View reviewed changes

jreback requested changes Mar 15, 2019

View reviewed changes

Remove the unnecessary conversion.

4aa629e

jreback requested changes Mar 18, 2019

View reviewed changes

sighingnow added 2 commits March 26, 2019 11:28

Merge branch 'master' into fix-25691

d6c029d

Add comment in to_object_array_tuples.

2c896f9

jreback requested changes Mar 26, 2019

View reviewed changes

sighingnow added 2 commits March 26, 2019 20:20

Do conversion in cython routines.

d9e3fe7

Merge branch 'master' into fix-25691

24fc8de

jreback requested changes Mar 29, 2019

View reviewed changes

Remove input_rows.

e9a0ab3

jreback requested changes Mar 30, 2019

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

jreback added this to the 0.25.0 milestone Mar 30, 2019

Remove the extra docstring.

4e41de3

jreback approved these changes Apr 5, 2019

View reviewed changes

jreback merged commit 9fbb9e7 into pandas-dev:master Apr 5, 2019

simonjayhawkins mentioned this pull request Apr 1, 2020

Crash with access violation (exit code -1073741819 (0xC0000005)) on Windows 10 #32776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Convert tuple to list before `_list_to_arrays` when construct DataFrame. #25731

BUG: Convert tuple to list before `_list_to_arrays` when construct DataFrame. #25731

sighingnow commented Mar 14, 2019

jreback Mar 14, 2019

sighingnow Mar 14, 2019

jreback Mar 14, 2019

jreback Mar 14, 2019

sighingnow Mar 14, 2019

codecov bot commented Mar 14, 2019 •

edited

Loading

codecov bot commented Mar 14, 2019 •

edited

Loading

jreback Mar 14, 2019

sighingnow Mar 15, 2019

jreback Mar 15, 2019

sighingnow Mar 15, 2019

jreback Mar 18, 2019

sighingnow Mar 26, 2019

jreback Mar 26, 2019

sighingnow Mar 26, 2019

jreback Mar 15, 2019

jreback Mar 18, 2019

jreback Mar 18, 2019

jreback commented Mar 26, 2019

sighingnow commented Mar 26, 2019

jreback Mar 26, 2019

jreback Mar 29, 2019

sighingnow Mar 29, 2019

jreback left a comment

jreback commented Apr 5, 2019

BUG: Convert tuple to list before _list_to_arrays when construct DataFrame. #25731

BUG: Convert tuple to list before _list_to_arrays when construct DataFrame. #25731

Conversation

sighingnow commented Mar 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 14, 2019 • edited Loading

Codecov Report

codecov bot commented Mar 14, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 26, 2019

sighingnow commented Mar 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Apr 5, 2019

BUG: Convert tuple to list before `_list_to_arrays` when construct DataFrame. #25731

BUG: Convert tuple to list before `_list_to_arrays` when construct DataFrame. #25731

codecov bot commented Mar 14, 2019 •

edited

Loading

codecov bot commented Mar 14, 2019 •

edited

Loading