BUG: Support to create DataFrame from list subclasses #21238

mitar · 2018-05-29T04:16:58Z

closes Cannot create a DataFrame from list subclass #21226
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Alternative fix could be that instead of doing isinstance check in frame's _to_arrays method:

if isinstance(data[0], (list, tuple)):

We would do:

if type(data[0]) in (list, tuple):

This would then assure that to_object_array is called really just with exactly lists. But currently there is a mismatch, if checks with subtypes, while to_object_array does not support them. I think it is better to support them so this merge request adds support for subtypes/subclasses of list.

pep8speaks · 2018-05-29T04:17:01Z

Hello @mitar! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 29, 2018 at 04:52 Hours UTC

mitar · 2018-05-29T04:20:40Z

I also ran benchmarks:

     [1c2844ac]       [3a64f848]
+         255±1μs         99.7±1ms   391.76  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>)
+        2.51±0ms           5.68ms     2.26  index_object.Ops.time_multiply('float')
+           1.39s            2.45s     1.76  plotting.Misc.time_plot_andrews_curves
+        31.3±1ms       43.5±0.1ms     1.39  join_merge.MergeAsof.time_on_int
+      39.0±0.8ms         52.9±2ms     1.36  eval.Eval.time_mult('python', 1)
+        669±30μs         849±30μs     1.27  join_merge.JoinNonUnique.time_join_non_unique_equal
+         137±5ms          174±2ms     1.27  eval.Eval.time_chained_cmp('python', 'all')
+      35.3±0.7ms       43.6±0.4ms     1.24  categoricals.ValueCounts.time_value_counts(False)
+      57.1±0.5ms         70.3±1ms     1.23  join_merge.MergeAsof.time_by_int
+     4.39±0.05ms      5.40±0.03ms     1.23  rolling.Quantile.time_quantile('Series', 1000, 'int', 1, 'linear')
+           4.80s            5.79s     1.21  gil.ParallelGroups.time_get_groups(8)
+     64.4±0.09ms      75.0±0.08ms     1.16  groupby.Groups.time_series_groups('int64_small')
+         932±2ns      1.08±0.01μs     1.16  timestamp.TimestampProperties.time_quarter(None, None)
+     1.21±0.01μs      1.40±0.03μs     1.15  timestamp.TimestampProperties.time_dayofyear(None, 'B')
+        76.5±2ms         85.9±1ms     1.12  eval.Eval.time_add('numexpr', 'all')
+           280ms            312ms     1.12  gil.ParallelKth.time_kth_smallest
+     3.72±0.01ms         4.14±0ms     1.11  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
+        3.70±0ms      4.11±0.01ms     1.11  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
+        4.18±0ms      4.61±0.01ms     1.10  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
+        92.7±3ms        102±0.8ms     1.10  strings.Methods.time_pad
-        66.0±1ms       59.9±0.2ms     0.91  strings.Methods.time_strip
-           103ms       91.1±0.2ms     0.88  frame_methods.Count.time_count_level_mixed_dtypes_multi(0)
-     6.81±0.02ms       5.93±0.1ms     0.87  rolling.Quantile.time_quantile('Series', 1000, 'float', 1, 'midpoint')
-          16.7ms       14.5±0.8ms     0.87  frame_methods.Reindex.time_reindex_upcast
-           1.43s            1.23s     0.86  join_merge.I8Merge.time_i8merge('left')
-         242±1ms          209±3ms     0.86  rolling.VariableWindowMethods.time_rolling('Series', '1d', 'float', 'median')
-     4.44±0.02μs      3.81±0.02μs     0.86  period.PeriodProperties.time_property('min', 'minute')
-          42.2ms           36.1ms     0.85  eval.Query.time_query_datetime_index
-          22.9ms           19.4ms     0.85  eval.Query.time_query_datetime_column
-      2.15±0.3ms      1.77±0.04ms     0.83  frame_methods.Isnull.time_isnull
-     12.1±0.06ms       9.97±0.1ms     0.82  rolling.VariableWindowMethods.time_rolling('Series', '1d', 'int', 'std')
-           908ms            730ms     0.80  join_merge.ConcatPanels.time_c_ordered(2, True)
-      99.1±0.7ms       79.2±0.6ms     0.80  rolling.VariableWindowMethods.time_rolling('Series', '50s', 'int', 'median')
-          15.1ms       12.0±0.2ms     0.80  io.hdf.HDFStoreDataFrame.time_query_store_table
-          10.5ms           8.17ms     0.78  frame_methods.Reindex.time_reindex_axis0
-           1.05s            718ms     0.69  join_merge.ConcatPanels.time_c_ordered(2, False)
-        174±30ms         94.8±2ms     0.55  eval.Eval.time_and('numexpr', 'all')
-        17.2±2ms      8.24±0.07ms     0.48  binary_ops.Ops.time_frame_mult(True, 1)

I do not see how any of these changes are related to list constructor, so not sure if they are relevant?

WillAyd · 2018-05-29T04:20:18Z

pandas/_libs/src/inference.pyx

@@ -1487,7 +1487,7 @@ def map_infer(ndarray arr, object f, bint convert=1):
    return result


-def to_object_array(list rows, int min_width=0):
+def to_object_array(rows, int min_width=0):


Does this have any noticeable performance impact?

Sorry noticed you posted ASVs before but the results should probably come from frame_ctor.py. On initial glance I didn't see anything to measure construction just from a list - can you double check that and if not add to ensure no major regression?

No performance impact.

WillAyd · 2018-05-29T04:21:28Z

pandas/_libs/src/inference.pyx

        list row

+    input_rows = <list>rows


Hmm what's the point of this variable if it's never used elsewhere?

It is just to type check. :-) So this will throw an exception if it is not really a list or its subclass. I can also use it later on.

codecov · 2018-05-29T04:52:47Z

Codecov Report

Merging #21238 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #21238   +/-   ##
=======================================
  Coverage   92.22%   92.22%           
=======================================
  Files         162      162           
  Lines       51828    51828           
=======================================
  Hits        47798    47798           
  Misses       4030     4030

Flag	Coverage Δ
#multiple	`90.62% <ø> (ø)`	⬆️
#single	`43% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 040f06f...8661bbb. Read the comment docs.

jorisvandenbossche · 2018-05-29T05:01:21Z

Can you give a timing of the benchmark that you added (I would just do it with %timeit using the same code, that's a bit easier / more reliable) before/after?
I don't expect a difference, but just to be sure.

mitar · 2018-05-29T05:20:54Z

I ran the benchmark:

· Running 22 total benchmarks (2 commits * 1 environments * 11 benchmarks)
[  0.00%] · For pandas commit hash 403c23d4:
[  0.00%] ·· Building for virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[  4.55%] ··· Running frame_ctor.FromDicts.time_list_of_dict                                                                                                                         89.1±0.4ms
[  9.09%] ··· Running frame_ctor.FromDicts.time_nested_dict                                                                                                                             101±1ms
[ 13.64%] ··· Running frame_ctor.FromDicts.time_nested_dict_columns                                                                                                                     103±1ms
[ 18.18%] ··· Running frame_ctor.FromDicts.time_nested_dict_index                                                                                                                    70.4±0.6ms
[ 22.73%] ··· Running frame_ctor.FromDicts.time_nested_dict_index_columns                                                                                                            72.5±0.7ms
[ 27.27%] ··· Running frame_ctor.FromDicts.time_nested_dict_int64                                                                                                                     202±0.2ms
[ 31.82%] ··· Running frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets                                                                                          79.6±0.2ms;...
[ 36.36%] ··· Running frame_ctor.FromLists.time_frame_from_lists                                                                                                                     60.3±0.4ms
[ 40.91%] ··· Running frame_ctor.FromNDArray.time_frame_from_ndarray                                                                                                                  173±0.3μs
[ 45.45%] ··· Running frame_ctor.FromRecords.time_frame_from_records_generator                                                                                                      195±2ms;...
[ 50.00%] ··· Running frame_ctor.FromSeries.time_mi_series                                                                                                                            209±0.2μs
[ 50.00%] · For pandas commit hash 1c2844ac:
[ 50.00%] ·· Building for virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 54.55%] ··· Running frame_ctor.FromDicts.time_list_of_dict                                                                                                                             88.6ms
[ 59.09%] ··· Running frame_ctor.FromDicts.time_nested_dict                                                                                                                             101±1ms
[ 63.64%] ··· Running frame_ctor.FromDicts.time_nested_dict_columns                                                                                                                     104±1ms
[ 68.18%] ··· Running frame_ctor.FromDicts.time_nested_dict_index                                                                                                                    70.7±0.6ms
[ 72.73%] ··· Running frame_ctor.FromDicts.time_nested_dict_index_columns                                                                                                            71.2±0.7ms
[ 77.27%] ··· Running frame_ctor.FromDicts.time_nested_dict_int64                                                                                                                     202±0.3ms
[ 81.82%] ··· Running frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets                                                                                          76.8±0.3ms;...
[ 86.36%] ··· Running frame_ctor.FromLists.time_frame_from_lists                                                                                                                     59.2±0.6ms
[ 90.91%] ··· Running frame_ctor.FromNDArray.time_frame_from_ndarray                                                                                                                  176±0.2μs
[ 95.45%] ··· Running frame_ctor.FromRecords.time_frame_from_records_generator                                                                                                      203±1ms;...
[100.00%] ··· Running frame_ctor.FromSeries.time_mi_series                                                                                                                            209±0.2μs
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

mitar · 2018-05-29T05:23:07Z

@jorisvandenbossche I am not sure what you are asking. I looks like one run takes around 60 ms from the results above.

jorisvandenbossche · 2018-05-29T05:27:34Z

What I meant was doing (before and after):

In [7]: N = 1000
   ...: M = 100
   ...: data = [[j for j in range(M)] for i in range(N)]

In [8]: %timeit DataFrame(data)
31.1 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

as if you want to time a very specific thing, %timeit is both easier and more accurate than ASV in my experience (but asv is still important to keep track of this in long term / benchmark changes that can have wide impact).
But what you posted is fine as well! Performance is not affected.

mitar · 2018-05-29T05:30:14Z

Oh, I didn't know about %timeit.

WillAyd · 2018-05-29T21:04:21Z

pandas/_libs/src/inference.pyx

@@ -1508,20 +1508,22 @@ def to_object_array(list rows, int min_width=0):
    cdef:
        Py_ssize_t i, j, n, k, tmp
        ndarray[object, ndim=2] result
+        list input_rows


Is there not a way to make the type declaration part of the call signature? Just wondering if that wouldn't be cleaner. Also curious if doing it in this fashion doesn't require more memory to construct a DataFrame, which maybe is an issue if dealing with very large lists

I am not an expert in Cython, but this approach is what I found using Google on their mailing list. :-) So not having it in the signature and cast it later. (Casting still throws an exception if impossible.)

I do not see why would this require any more memory?

May not - just not sure how the casting rules are applied. Looking at the module there appear to be other instances where the type is in the signature (see to_object_array_tuples). Seems like that would be preferable if its an option

What do you mean? The whole issue is that if you put the list in the signature, then Cython requires exactly the list signature and not a subclass. So removing the signature and doing casting fixes this. In to_object_array_tuples the argument type is list, which again means that only strict list type can be provided.

I can put object type there. It seems equivalent. This is what infer_dtype has.

jreback · 2018-05-30T01:10:15Z

why is a List subclass actually useful?

mitar · 2018-05-30T01:47:47Z

I mean, there are different places where you can get data which is a list subclass. So if you want to convert such data to DataFrame, then why not? For example, we are using a list subclass to have additional methods on the list and additional metadata.

In any case, Pandas is currently inconsistent and something has to be changed. Currently is is checking with isnstance(data, list) if you can call this method, and then calls it, which fails. So or that check is changed, or this method is changed.

I do not understsand why we would not change this method? There is no performance penalty, and it allows easier interoperability with other data. In general Pandas tries to be well integrated with Python ecosystem, so if something is an instance of list, in Python, this means that you can always use it as list. But Pandas currently does not support this.

So, my question is, why not? What is problematic in this pull request?

jorisvandenbossche · 2018-05-30T07:51:46Z

I am fine with this patch. It is a minor change and easy to support it.
In any case, the current error message is not really use friendly (due to the inconsistency @mitar notes between the isinstance check and the actual cython function)

jreback · 2018-05-30T10:49:39Z

The problem is that, we have many many places where we actually check for lists for various reasons, e.g. its is not a scalar, not a tuple and also we care if things are array-like (IOW have a dtype) or not. So sure this patch looks ok, but this is actually introducing way more inconsistencies that trying to solve one.

mitar · 2018-05-30T15:46:34Z

Are you checking elsewhere with type(...) is or with isinstance?

mitar · 2018-07-24T05:27:16Z

Any update here? So as I mentioned, I think there is or a problem in code path that it first checks with isinstance and later on requires fixed list type, or that we fix that later on it also checks with isinstance.

jreback · 2018-10-11T01:57:54Z

if you want to merge master and update, can see where this is

jreback · 2018-11-23T03:33:26Z

closing as stale. if you'd like to continue, pls ping.

rok · 2018-12-14T04:50:15Z

closing as stale. if you'd like to continue, pls ping.

Hey @jreback I'd like to help close this. Could you reopen the PR please so I commit changes to it?

rok · 2018-12-14T13:21:57Z

Hey all, I think the discussion here was already completed and we can merge?

jreback

can you show the result of running that benchmarks (before and after). also can you run the DataFrame constructor benchmarks and report.

jreback · 2018-12-14T13:49:45Z

pandas/_libs/lib.pyx

@@ -2208,7 +2208,7 @@ def map_infer(ndarray arr, object f, bint convert=1):
    return result


-def to_object_array(rows: list, min_width: int=0):
+def to_object_array(rows, int min_width=0):


can you leave the type annotation

jreback · 2018-12-14T13:49:55Z

pandas/_libs/lib.pyx

@@ -2208,7 +2208,7 @@ def map_infer(ndarray arr, object f, bint convert=1):
    return result


-def to_object_array(rows: list, min_width: int=0):
+def to_object_array(rows, int min_width=0):


type rows as object

Suggested change

def to_object_array(rows, int min_width=0):

def to_object_array(rows: object, int min_width=0):

Looks good indeed!

before after ratio [a1cf3f65] [040f06f7] + 96.3±0.04ms 126±20ms 1.31 reindex.DropDuplicates.time_frame_drop_dups(True) + 7.32±0.06μs 9.50±0.1μs 1.30 indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'unique_monotonic_inc') + 21.7±0.4μs 25.4±0.2μs 1.17 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc') + 62.5±0.1μs 71.6±0.7μs 1.15 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc') - 9.59±0.3ms 8.45±0.2ms 0.88 indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc') - 126±0.2μs 111±8μs 0.88 index_object.SetOperations.time_operation('datetime', 'union') - 86.4±4ms 83.4±4μs 0.00 multiindex_object.GetLoc.time_large_get_loc SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'm going to rerun for the ratio > 1 and report back.

I've rerun twice overnight and reported significant changes don't overlap between runs. I assume this is just the benchmark being noisy. (Commit hashes changed because I rebased again)

before after ratio [d43ac979] [8661bbb6] <master> <from-list> + 50.7±2ms 93.0±3ms 1.84 frame_methods.Iteration.time_iteritems_indexing + 6.32±0.02μs 7.65±1μs 1.21 timestamp.TimestampOps.time_normalize(tzutc()) + 11.5±0.03μs 13.3±1μs 1.16 offset.OffestDatetimeArithmetic.time_apply(<BusinessMonthEnd>) + 2.51±0.01μs 2.88±0.2μs 1.15 timeseries.DatetimeIndex.time_get('dst') + 2.50±0.01μs 2.85±0.2μs 1.14 timeseries.DatetimeIndex.time_get('tz_naive') + 11.8±0.2μs 13.4±1μs 1.13 timestamp.TimestampConstruction.time_parse_iso8601_tz + 2.66±0.01μs 3.00±0.2μs 1.13 timeseries.DatetimeIndex.time_get('repeated') + 20.8±0.07μs 23.1±2μs 1.11 timestamp.TimestampProperties.time_weekday_name(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None) - 8.40±0.3ms 7.48±0.2ms 0.89 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc') - 9.62±0.3ms 8.52±0.3ms 0.89 indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc') - 8.85±0.4ms 7.69±0.1ms 0.87 indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc') - 8.68±0.8μs 7.48±0.03μs 0.86 indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr') - 1.13±0.2μs 896±6ns 0.79 period.PeriodProperties.time_property('min', 'day') - 86.8±0.6ms 100±4μs 0.00 multiindex_object.GetLoc.time_large_get_loc SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. before after ratio [d43ac979] [8661bbb6] <master> <from-list> + 82.4±1μs 104±1μs 1.26 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc') + 7.09±0.06μs 8.21±0.8μs 1.16 indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'unique_monotonic_inc') + 18.5±0.3μs 20.6±1μs 1.11 indexing.NonNumericSeriesIndexing.time_get_value('datetime', 'unique_monotonic_inc') - 803±6ms 714±4ms 0.89 gil.ParallelGroupbyMethods.time_parallel(8, 'mean') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2018-12-14T13:51:01Z

@WillAyd if you have any commets

Issue: pandas-dev#21226

jreback · 2018-12-15T21:21:27Z

thanks @mitar

mitar · 2018-12-15T21:31:48Z

Thanks @rok.

rok · 2018-12-15T23:23:09Z

Thanks all! :)

) Issue: pandas-dev#21226

mitar force-pushed the from-list branch from 025d80d to efb0cdc Compare May 29, 2018 04:19

mitar mentioned this pull request May 29, 2018

Cannot create a DataFrame from list subclass #21226

Closed

WillAyd requested changes May 29, 2018

View reviewed changes

mitar force-pushed the from-list branch 2 times, most recently from 8db5eaf to 403c23d Compare May 29, 2018 04:52

jorisvandenbossche approved these changes May 29, 2018

View reviewed changes

WillAyd reviewed May 29, 2018

View reviewed changes

jreback added the Compat pandas objects compatability with Numpy or Python functions label May 30, 2018

jreback closed this Nov 23, 2018

jreback reopened this Dec 14, 2018

rok force-pushed the from-list branch from 403c23d to 1395920 Compare December 14, 2018 12:24

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Dec 14, 2018

jreback added this to the 0.24.0 milestone Dec 14, 2018

jreback requested changes Dec 14, 2018

View reviewed changes

BUG: Support to create DataFrame from list subclasses.

8661bbb

Issue: pandas-dev#21226

rok force-pushed the from-list branch from 1395920 to 8661bbb Compare December 14, 2018 17:31

jreback approved these changes Dec 15, 2018

View reviewed changes

jreback merged commit ca85a41 into pandas-dev:master Dec 15, 2018

mitar deleted the from-list branch December 15, 2018 21:31

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Support to create DataFrame from list subclasses. (pandas-dev#21238

673d6d3

) Issue: pandas-dev#21226

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Support to create DataFrame from list subclasses. (pandas-dev#21238

9d15068

) Issue: pandas-dev#21226

	def to_object_array(rows, int min_width=0):
	def to_object_array(rows: object, int min_width=0):

BUG: Support to create DataFrame from list subclasses #21238

BUG: Support to create DataFrame from list subclasses #21238

Conversation

mitar commented May 29, 2018

pep8speaks commented May 29, 2018 • edited Loading

Comment last updated on May 29, 2018 at 04:52 Hours UTC

mitar commented May 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 29, 2018 • edited Loading

Codecov Report

jorisvandenbossche commented May 29, 2018

mitar commented May 29, 2018

mitar commented May 29, 2018

jorisvandenbossche commented May 29, 2018

mitar commented May 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 30, 2018

mitar commented May 30, 2018

jorisvandenbossche commented May 30, 2018

jreback commented May 30, 2018

mitar commented May 30, 2018

mitar commented Jul 24, 2018

jreback commented Oct 11, 2018

jreback commented Nov 23, 2018

rok commented Dec 14, 2018

rok commented Dec 14, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok Dec 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 14, 2018

jreback commented Dec 15, 2018

mitar commented Dec 15, 2018

rok commented Dec 15, 2018

pep8speaks commented May 29, 2018 •

edited

Loading

codecov bot commented May 29, 2018 •

edited

Loading

rok Dec 14, 2018 •

edited

Loading