Skip to content

BUG: Convert tuple to list before _list_to_arrays when construct DataFrame. #25731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 5, 2019

Conversation

sighingnow
Copy link
Contributor

In _list_to_arrays, to_object_array_tuples or to_object_array is called, both require the argument to be a list (list of tuples or list of list).

@sighingnow sighingnow changed the title Convert tuple to list before _list_to_arrays when construct DataFrame. BUG: Convert tuple to list before _list_to_arrays when construct DataFrame. Mar 14, 2019
@@ -394,7 +394,7 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):
return [[]] * len(columns), columns
return [], [] # columns if columns is not None else []
if isinstance(data[0], (list, tuple)):
return _list_to_arrays(data, columns, coerce_float=coerce_float,
return _list_to_arrays(list(data), columns, coerce_float=coerce_float,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to put this in _list_to_arrays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -124,7 +124,7 @@ Bug Fixes
~~~~~~~~~
- Bug in :func:`to_datetime` which would raise an (incorrect) ``ValueError`` when called with a date far into the future and the ``format`` argument specified instead of raising ``OutOfBoundsDatetime`` (:issue:`23830`)
- Bug in an error message in :meth:`DataFrame.plot`. Improved the error message if non-numerics are passed to :meth:`DataFrame.plot` (:issue:`25481`)
-
- Segmentation fault when construct :class:`DataFrame` from non-empty tuples (:issue:`25691`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to Bug Fixes reshaping section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 14, 2019
@codecov
Copy link

codecov bot commented Mar 14, 2019

Codecov Report

Merging #25731 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25731      +/-   ##
==========================================
+ Coverage   91.24%   91.25%   +<.01%     
==========================================
  Files         172      172              
  Lines       52973    52973              
==========================================
+ Hits        48337    48338       +1     
+ Misses       4636     4635       -1
Flag Coverage Δ
#multiple 89.82% <100%> (ø) ⬆️
#single 41.74% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/construction.py 95.9% <100%> (ø) ⬆️
pandas/util/testing.py 89.08% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5d251d...2f00c7f. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 14, 2019

Codecov Report

Merging #25731 into master will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25731      +/-   ##
==========================================
+ Coverage   91.77%   91.81%   +0.03%     
==========================================
  Files         175      175              
  Lines       52606    52580      -26     
==========================================
- Hits        48280    48274       -6     
+ Misses       4326     4306      -20
Flag Coverage Δ
#multiple 90.36% <ø> (+0.04%) ⬆️
#single 41.9% <ø> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/config_init.py 96.96% <0%> (-2.24%) ⬇️
pandas/core/computation/common.py 89.47% <0%> (-0.53%) ⬇️
pandas/compat/pickle_compat.py 69.13% <0%> (-0.38%) ⬇️
pandas/core/groupby/grouper.py 98.16% <0%> (-0.37%) ⬇️
pandas/compat/numpy/__init__.py 92.85% <0%> (-0.25%) ⬇️
pandas/plotting/_style.py 77.17% <0%> (-0.25%) ⬇️
pandas/core/computation/engines.py 88.52% <0%> (-0.19%) ⬇️
pandas/plotting/_timeseries.py 65.28% <0%> (-0.18%) ⬇️
pandas/io/excel/_util.py 87.5% <0%> (-0.18%) ⬇️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7721f70...4e41de3. Read the comment docs.

* master:
  Fixturize tests/frame/test_operators.py (pandas-dev#25641)
  Update ValueError message in corr (pandas-dev#25729)

# Conflicts:
#	doc/source/whatsnew/v0.25.0.rst
Signed-off-by: HE, Tao <[email protected]>
@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
if len(data) > 0 and isinstance(data[0], tuple):
content = list(lib.to_object_array_tuples(data).T)
content = list(lib.to_object_array_tuples(list(data)).T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this down even futher into to_object_array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: to_object_array_tuples is used to convert list of tuples, and to_object_array is used to convert list of lists. How could these two case be unified to to_object_array ?

The tuple (or list) needs to be converted to list before send to to_object_array_tuples, otherwise we will see incorrect type error for pd.DataFrame(((),())).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is push the listifying down to these 2 routines, in fact to_object_array already has this, need to add similar in to_object_array_tuples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the segmentation fault happens in #25691 is caused by to_object_array, rather than to_object_array_tuples. The input_rows = <list>rows does nothing than an assignment:

  /* "pandas/_libs/lib.pyx":2272
 *         list row
 * 
 *     input_rows = <list>rows             # <<<<<<<<<<<<<<
 *     n = len(input_rows)
 * 
 */
  __pyx_t_1 = __pyx_v_rows;
  __Pyx_INCREF(__pyx_t_1);
  __pyx_v_input_rows = ((PyObject*)__pyx_t_1);
  __pyx_t_1 = 0;

The rows and input_rows are both PyObject *. The whole story is, when we invoke to_object_array with ([], []), we use list's [](__Pyx_GetItemInt_List_Fast), it results a null pointer. Then we Py_INCREF it, thus the segmentation fault.

I have revised the patch to remove the unused conversion in to_object_array and restrict its argument type to list. Another possible solution should be change the type of argument rows to object from list the both two methods, but I think lists is used more commonly to construct DataFrame than tuples. __Pyx_GetItemInt_List_Fast should be faster than __Pyx_GetItemInt_Generic and the conversion from tuple to list in python side is fairly ok.

The backtrace of the segmentation fault
>>> lib.to_object_array(([], []))

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829
55829           Py_INCREF(r);
(gdb) bt
#0  __Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829
#1  __pyx_pf_6pandas_5_libs_3lib_102to_object_array (__pyx_self=<optimized out>, __pyx_v_min_width=<optimized out>,
    __pyx_v_rows=([], [])) at pandas/_libs/lib.c:30958
#2  __pyx_pw_6pandas_5_libs_3lib_103to_object_array (__pyx_self=<optimized out>, __pyx_args=<optimized out>,
    __pyx_kwds=<optimized out>) at pandas/_libs/lib.c:30855

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, IIRC this was put in here pretty specifically, can you see why that was? in any event, I believe you can simply type the input arg to to_object_array_tuples no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In to_object_array_tuples, if the element in rows is not a tuple, it will be cast to a tuple.

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L2325-L2337

However I still think it make sense to treat list of lists and list of tuples differently and dispatch them to different methods to avoid the overhead caused by casting list to tuple. I think list of lists may be used more frequently. I have add a comment to to_object_array_tuples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sighingnow I agree with your point, I am not suggesting combing these routines at all, rather inside each routine the data is first cast to a list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
if len(data) > 0 and isinstance(data[0], tuple):
content = list(lib.to_object_array_tuples(data).T)
content = list(lib.to_object_array_tuples(list(data)).T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is push the listifying down to these 2 routines, in fact to_object_array already has this, need to add similar in to_object_array_tuples

@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
if len(data) > 0 and isinstance(data[0], tuple):
content = list(lib.to_object_array_tuples(data).T)
content = list(lib.to_object_array_tuples(list(data)).T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, IIRC this was put in here pretty specifically, can you see why that was? in any event, I believe you can simply type the input arg to to_object_array_tuples no?

@@ -245,6 +245,7 @@ Reshaping
- Bug in :func:`merge` when merging by index name would sometimes result in an incorrectly numbered index (:issue:`24212`)
- :func:`to_records` now accepts dtypes to its `column_dtypes` parameter (:issue:`24895`)
- Bug in :func:`concat` where order of ``OrderedDict`` (and ``dict`` in Python 3.6+) is not respected, when passed in as ``objs`` argument (:issue:`21510`)
- Bug in :class:`DataFrame` construct when passing non-empty tuples would cause segmentation fault (:issue:`25691`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in :class:DataFrame constructor when passing non-empty tuples would cause a segmentation fault

@jreback
Copy link
Contributor

jreback commented Mar 26, 2019

can you merge master and update

@sighingnow
Copy link
Contributor Author

Merged master.

@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None):

def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
if len(data) > 0 and isinstance(data[0], tuple):
content = list(lib.to_object_array_tuples(data).T)
content = list(lib.to_object_array_tuples(list(data)).T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sighingnow I agree with your point, I am not suggesting combing these routines at all, rather inside each routine the data is first cast to a list

tuple row

n = len(rows)
input_rows = list(rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave this as

rows=list(rows) and remove input_rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and removed the input_rows in to_object_array as well.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex the doc-string which is incorrect.

@jreback jreback added this to the 0.25.0 milestone Mar 30, 2019
@jreback jreback merged commit 9fbb9e7 into pandas-dev:master Apr 5, 2019
@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

thanks @sighingnow nice patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Segmentation fault using tuple as iterator for DataFrame constructor
2 participants