-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Convert tuple to list before _list_to_arrays
when construct DataFrame.
#25731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: HE, Tao <[email protected]>
_list_to_arrays
when construct DataFrame._list_to_arrays
when construct DataFrame.
@@ -394,7 +394,7 @@ def to_arrays(data, columns, coerce_float=False, dtype=None): | |||
return [[]] * len(columns), columns | |||
return [], [] # columns if columns is not None else [] | |||
if isinstance(data[0], (list, tuple)): | |||
return _list_to_arrays(data, columns, coerce_float=coerce_float, | |||
return _list_to_arrays(list(data), columns, coerce_float=coerce_float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to put this in _list_to_arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -124,7 +124,7 @@ Bug Fixes | |||
~~~~~~~~~ | |||
- Bug in :func:`to_datetime` which would raise an (incorrect) ``ValueError`` when called with a date far into the future and the ``format`` argument specified instead of raising ``OutOfBoundsDatetime`` (:issue:`23830`) | |||
- Bug in an error message in :meth:`DataFrame.plot`. Improved the error message if non-numerics are passed to :meth:`DataFrame.plot` (:issue:`25481`) | |||
- | |||
- Segmentation fault when construct :class:`DataFrame` from non-empty tuples (:issue:`25691`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to Bug Fixes reshaping section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Codecov Report
@@ Coverage Diff @@
## master #25731 +/- ##
==========================================
+ Coverage 91.24% 91.25% +<.01%
==========================================
Files 172 172
Lines 52973 52973
==========================================
+ Hits 48337 48338 +1
+ Misses 4636 4635 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25731 +/- ##
==========================================
+ Coverage 91.77% 91.81% +0.03%
==========================================
Files 175 175
Lines 52606 52580 -26
==========================================
- Hits 48280 48274 -6
+ Misses 4326 4306 -20
Continue to review full report at Codecov.
|
* master: Fixturize tests/frame/test_operators.py (pandas-dev#25641) Update ValueError message in corr (pandas-dev#25729) # Conflicts: # doc/source/whatsnew/v0.25.0.rst
Signed-off-by: HE, Tao <[email protected]>
@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None): | |||
|
|||
def _list_to_arrays(data, columns, coerce_float=False, dtype=None): | |||
if len(data) > 0 and isinstance(data[0], tuple): | |||
content = list(lib.to_object_array_tuples(data).T) | |||
content = list(lib.to_object_array_tuples(list(data)).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this down even futher into to_object_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: to_object_array_tuples
is used to convert list of tuples, and to_object_array
is used to convert list of lists. How could these two case be unified to to_object_array
?
The tuple (or list) needs to be converted to list before send to to_object_array_tuples
, otherwise we will see incorrect type error for pd.DataFrame(((),()))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is push the listifying down to these 2 routines, in fact to_object_array
already has this, need to add similar in to_object_array_tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, the segmentation fault happens in #25691 is caused by to_object_array
, rather than to_object_array_tuples
. The input_rows = <list>rows
does nothing than an assignment:
/* "pandas/_libs/lib.pyx":2272
* list row
*
* input_rows = <list>rows # <<<<<<<<<<<<<<
* n = len(input_rows)
*
*/
__pyx_t_1 = __pyx_v_rows;
__Pyx_INCREF(__pyx_t_1);
__pyx_v_input_rows = ((PyObject*)__pyx_t_1);
__pyx_t_1 = 0;
The rows
and input_rows
are both PyObject *
. The whole story is, when we invoke to_object_array
with ([], [])
, we use list's []
(__Pyx_GetItemInt_List_Fast), it results a null pointer. Then we Py_INCREF
it, thus the segmentation fault.
I have revised the patch to remove the unused conversion in to_object_array
and restrict its argument type to list
. Another possible solution should be change the type of argument rows
to object
from list
the both two methods, but I think lists is used more commonly to construct DataFrame than tuples. __Pyx_GetItemInt_List_Fast
should be faster than __Pyx_GetItemInt_Generic
and the conversion from tuple to list in python side is fairly ok.
The backtrace of the segmentation fault
>>> lib.to_object_array(([], []))
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829
55829 Py_INCREF(r);
(gdb) bt
#0 __Pyx_GetItemInt_List_Fast (boundscheck=1, wraparound=1, i=0, o=([], [])) at pandas/_libs/lib.c:55829
#1 __pyx_pf_6pandas_5_libs_3lib_102to_object_array (__pyx_self=<optimized out>, __pyx_v_min_width=<optimized out>,
__pyx_v_rows=([], [])) at pandas/_libs/lib.c:30958
#2 __pyx_pw_6pandas_5_libs_3lib_103to_object_array (__pyx_self=<optimized out>, __pyx_args=<optimized out>,
__pyx_kwds=<optimized out>) at pandas/_libs/lib.c:30855
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, IIRC this was put in here pretty specifically, can you see why that was? in any event, I believe you can simply type the input arg to to_object_array_tuples
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In to_object_array_tuples
, if the element in rows
is not a tuple, it will be cast to a tuple.
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L2325-L2337
However I still think it make sense to treat list of lists
and list of tuples
differently and dispatch them to different methods to avoid the overhead caused by casting list
to tuple
. I think list of lists
may be used more frequently. I have add a comment to to_object_array_tuples
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sighingnow I agree with your point, I am not suggesting combing these routines at all, rather inside each routine the data is first cast to a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None): | |||
|
|||
def _list_to_arrays(data, columns, coerce_float=False, dtype=None): | |||
if len(data) > 0 and isinstance(data[0], tuple): | |||
content = list(lib.to_object_array_tuples(data).T) | |||
content = list(lib.to_object_array_tuples(list(data)).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is push the listifying down to these 2 routines, in fact to_object_array
already has this, need to add similar in to_object_array_tuples
@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None): | |||
|
|||
def _list_to_arrays(data, columns, coerce_float=False, dtype=None): | |||
if len(data) > 0 and isinstance(data[0], tuple): | |||
content = list(lib.to_object_array_tuples(data).T) | |||
content = list(lib.to_object_array_tuples(list(data)).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, IIRC this was put in here pretty specifically, can you see why that was? in any event, I believe you can simply type the input arg to to_object_array_tuples
no?
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -245,6 +245,7 @@ Reshaping | |||
- Bug in :func:`merge` when merging by index name would sometimes result in an incorrectly numbered index (:issue:`24212`) | |||
- :func:`to_records` now accepts dtypes to its `column_dtypes` parameter (:issue:`24895`) | |||
- Bug in :func:`concat` where order of ``OrderedDict`` (and ``dict`` in Python 3.6+) is not respected, when passed in as ``objs`` argument (:issue:`21510`) | |||
- Bug in :class:`DataFrame` construct when passing non-empty tuples would cause segmentation fault (:issue:`25691`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug in :class:DataFrame
constructor when passing non-empty tuples would cause a segmentation fault
can you merge master and update |
Merged master. |
@@ -422,10 +422,10 @@ def to_arrays(data, columns, coerce_float=False, dtype=None): | |||
|
|||
def _list_to_arrays(data, columns, coerce_float=False, dtype=None): | |||
if len(data) > 0 and isinstance(data[0], tuple): | |||
content = list(lib.to_object_array_tuples(data).T) | |||
content = list(lib.to_object_array_tuples(list(data)).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sighingnow I agree with your point, I am not suggesting combing these routines at all, rather inside each routine the data is first cast to a list
pandas/_libs/lib.pyx
Outdated
tuple row | ||
|
||
n = len(rows) | ||
input_rows = list(rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave this as
rows=list(rows)
and remove input_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and removed the input_rows
in to_object_array
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ex the doc-string which is incorrect.
thanks @sighingnow nice patch! |
In
_list_to_arrays
,to_object_array_tuples
orto_object_array
is called, both require the argument to be alist
(list of tuples or list of list).git diff upstream/master -u -- "*.py" | flake8 --diff