Skip to content

BUG: Support to create DataFrame from list subclasses #21238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 15, 2018

Conversation

mitar
Copy link
Contributor

@mitar mitar commented May 29, 2018

Alternative fix could be that instead of doing isinstance check in frame's _to_arrays method:

if isinstance(data[0], (list, tuple)):

We would do:

if type(data[0]) in (list, tuple):

This would then assure that to_object_array is called really just with exactly lists. But currently there is a mismatch, if checks with subtypes, while to_object_array does not support them. I think it is better to support them so this merge request adds support for subtypes/subclasses of list.

@pep8speaks
Copy link

pep8speaks commented May 29, 2018

Hello @mitar! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 29, 2018 at 04:52 Hours UTC

@mitar
Copy link
Contributor Author

mitar commented May 29, 2018

I also ran benchmarks:

     [1c2844ac]       [3a64f848]
+         255±1μs         99.7±1ms   391.76  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>)
+        2.51±0ms           5.68ms     2.26  index_object.Ops.time_multiply('float')
+           1.39s            2.45s     1.76  plotting.Misc.time_plot_andrews_curves
+        31.3±1ms       43.5±0.1ms     1.39  join_merge.MergeAsof.time_on_int
+      39.0±0.8ms         52.9±2ms     1.36  eval.Eval.time_mult('python', 1)
+        669±30μs         849±30μs     1.27  join_merge.JoinNonUnique.time_join_non_unique_equal
+         137±5ms          174±2ms     1.27  eval.Eval.time_chained_cmp('python', 'all')
+      35.3±0.7ms       43.6±0.4ms     1.24  categoricals.ValueCounts.time_value_counts(False)
+      57.1±0.5ms         70.3±1ms     1.23  join_merge.MergeAsof.time_by_int
+     4.39±0.05ms      5.40±0.03ms     1.23  rolling.Quantile.time_quantile('Series', 1000, 'int', 1, 'linear')
+           4.80s            5.79s     1.21  gil.ParallelGroups.time_get_groups(8)
+     64.4±0.09ms      75.0±0.08ms     1.16  groupby.Groups.time_series_groups('int64_small')
+         932±2ns      1.08±0.01μs     1.16  timestamp.TimestampProperties.time_quarter(None, None)
+     1.21±0.01μs      1.40±0.03μs     1.15  timestamp.TimestampProperties.time_dayofyear(None, 'B')
+        76.5±2ms         85.9±1ms     1.12  eval.Eval.time_add('numexpr', 'all')
+           280ms            312ms     1.12  gil.ParallelKth.time_kth_smallest
+     3.72±0.01ms         4.14±0ms     1.11  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
+        3.70±0ms      4.11±0.01ms     1.11  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
+        4.18±0ms      4.61±0.01ms     1.10  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
+        92.7±3ms        102±0.8ms     1.10  strings.Methods.time_pad
-        66.0±1ms       59.9±0.2ms     0.91  strings.Methods.time_strip
-           103ms       91.1±0.2ms     0.88  frame_methods.Count.time_count_level_mixed_dtypes_multi(0)
-     6.81±0.02ms       5.93±0.1ms     0.87  rolling.Quantile.time_quantile('Series', 1000, 'float', 1, 'midpoint')
-          16.7ms       14.5±0.8ms     0.87  frame_methods.Reindex.time_reindex_upcast
-           1.43s            1.23s     0.86  join_merge.I8Merge.time_i8merge('left')
-         242±1ms          209±3ms     0.86  rolling.VariableWindowMethods.time_rolling('Series', '1d', 'float', 'median')
-     4.44±0.02μs      3.81±0.02μs     0.86  period.PeriodProperties.time_property('min', 'minute')
-          42.2ms           36.1ms     0.85  eval.Query.time_query_datetime_index
-          22.9ms           19.4ms     0.85  eval.Query.time_query_datetime_column
-      2.15±0.3ms      1.77±0.04ms     0.83  frame_methods.Isnull.time_isnull
-     12.1±0.06ms       9.97±0.1ms     0.82  rolling.VariableWindowMethods.time_rolling('Series', '1d', 'int', 'std')
-           908ms            730ms     0.80  join_merge.ConcatPanels.time_c_ordered(2, True)
-      99.1±0.7ms       79.2±0.6ms     0.80  rolling.VariableWindowMethods.time_rolling('Series', '50s', 'int', 'median')
-          15.1ms       12.0±0.2ms     0.80  io.hdf.HDFStoreDataFrame.time_query_store_table
-          10.5ms           8.17ms     0.78  frame_methods.Reindex.time_reindex_axis0
-           1.05s            718ms     0.69  join_merge.ConcatPanels.time_c_ordered(2, False)
-        174±30ms         94.8±2ms     0.55  eval.Eval.time_and('numexpr', 'all')
-        17.2±2ms      8.24±0.07ms     0.48  binary_ops.Ops.time_frame_mult(True, 1)

I do not see how any of these changes are related to list constructor, so not sure if they are relevant?

@@ -1487,7 +1487,7 @@ def map_infer(ndarray arr, object f, bint convert=1):
return result


def to_object_array(list rows, int min_width=0):
def to_object_array(rows, int min_width=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any noticeable performance impact?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry noticed you posted ASVs before but the results should probably come from frame_ctor.py. On initial glance I didn't see anything to measure construction just from a list - can you double check that and if not add to ensure no major regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No performance impact.

list row

input_rows = <list>rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what's the point of this variable if it's never used elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just to type check. :-) So this will throw an exception if it is not really a list or its subclass. I can also use it later on.

@mitar mitar force-pushed the from-list branch 2 times, most recently from 8db5eaf to 403c23d Compare May 29, 2018 04:52
@codecov
Copy link

codecov bot commented May 29, 2018

Codecov Report

Merging #21238 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21238   +/-   ##
=======================================
  Coverage   92.22%   92.22%           
=======================================
  Files         162      162           
  Lines       51828    51828           
=======================================
  Hits        47798    47798           
  Misses       4030     4030
Flag Coverage Δ
#multiple 90.62% <ø> (ø) ⬆️
#single 43% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 040f06f...8661bbb. Read the comment docs.

@jorisvandenbossche
Copy link
Member

Can you give a timing of the benchmark that you added (I would just do it with %timeit using the same code, that's a bit easier / more reliable) before/after?
I don't expect a difference, but just to be sure.

@mitar
Copy link
Contributor Author

mitar commented May 29, 2018

I ran the benchmark:

· Running 22 total benchmarks (2 commits * 1 environments * 11 benchmarks)
[  0.00%] · For pandas commit hash 403c23d4:
[  0.00%] ·· Building for virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[  4.55%] ··· Running frame_ctor.FromDicts.time_list_of_dict                                                                                                                         89.1±0.4ms
[  9.09%] ··· Running frame_ctor.FromDicts.time_nested_dict                                                                                                                             101±1ms
[ 13.64%] ··· Running frame_ctor.FromDicts.time_nested_dict_columns                                                                                                                     103±1ms
[ 18.18%] ··· Running frame_ctor.FromDicts.time_nested_dict_index                                                                                                                    70.4±0.6ms
[ 22.73%] ··· Running frame_ctor.FromDicts.time_nested_dict_index_columns                                                                                                            72.5±0.7ms
[ 27.27%] ··· Running frame_ctor.FromDicts.time_nested_dict_int64                                                                                                                     202±0.2ms
[ 31.82%] ··· Running frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets                                                                                          79.6±0.2ms;...
[ 36.36%] ··· Running frame_ctor.FromLists.time_frame_from_lists                                                                                                                     60.3±0.4ms
[ 40.91%] ··· Running frame_ctor.FromNDArray.time_frame_from_ndarray                                                                                                                  173±0.3μs
[ 45.45%] ··· Running frame_ctor.FromRecords.time_frame_from_records_generator                                                                                                      195±2ms;...
[ 50.00%] ··· Running frame_ctor.FromSeries.time_mi_series                                                                                                                            209±0.2μs
[ 50.00%] · For pandas commit hash 1c2844ac:
[ 50.00%] ·· Building for virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking virtualenv-py3.5-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 54.55%] ··· Running frame_ctor.FromDicts.time_list_of_dict                                                                                                                             88.6ms
[ 59.09%] ··· Running frame_ctor.FromDicts.time_nested_dict                                                                                                                             101±1ms
[ 63.64%] ··· Running frame_ctor.FromDicts.time_nested_dict_columns                                                                                                                     104±1ms
[ 68.18%] ··· Running frame_ctor.FromDicts.time_nested_dict_index                                                                                                                    70.7±0.6ms
[ 72.73%] ··· Running frame_ctor.FromDicts.time_nested_dict_index_columns                                                                                                            71.2±0.7ms
[ 77.27%] ··· Running frame_ctor.FromDicts.time_nested_dict_int64                                                                                                                     202±0.3ms
[ 81.82%] ··· Running frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets                                                                                          76.8±0.3ms;...
[ 86.36%] ··· Running frame_ctor.FromLists.time_frame_from_lists                                                                                                                     59.2±0.6ms
[ 90.91%] ··· Running frame_ctor.FromNDArray.time_frame_from_ndarray                                                                                                                  176±0.2μs
[ 95.45%] ··· Running frame_ctor.FromRecords.time_frame_from_records_generator                                                                                                      203±1ms;...
[100.00%] ··· Running frame_ctor.FromSeries.time_mi_series                                                                                                                            209±0.2μs
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@mitar
Copy link
Contributor Author

mitar commented May 29, 2018

@jorisvandenbossche I am not sure what you are asking. I looks like one run takes around 60 ms from the results above.

@jorisvandenbossche
Copy link
Member

What I meant was doing (before and after):

In [7]: N = 1000
   ...: M = 100
   ...: data = [[j for j in range(M)] for i in range(N)]

In [8]: %timeit DataFrame(data)
31.1 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

as if you want to time a very specific thing, %timeit is both easier and more accurate than ASV in my experience (but asv is still important to keep track of this in long term / benchmark changes that can have wide impact).
But what you posted is fine as well! Performance is not affected.

@mitar
Copy link
Contributor Author

mitar commented May 29, 2018

Oh, I didn't know about %timeit.

@@ -1508,20 +1508,22 @@ def to_object_array(list rows, int min_width=0):
cdef:
Py_ssize_t i, j, n, k, tmp
ndarray[object, ndim=2] result
list input_rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there not a way to make the type declaration part of the call signature? Just wondering if that wouldn't be cleaner. Also curious if doing it in this fashion doesn't require more memory to construct a DataFrame, which maybe is an issue if dealing with very large lists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in Cython, but this approach is what I found using Google on their mailing list. :-) So not having it in the signature and cast it later. (Casting still throws an exception if impossible.)

I do not see why would this require any more memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not - just not sure how the casting rules are applied. Looking at the module there appear to be other instances where the type is in the signature (see to_object_array_tuples). Seems like that would be preferable if its an option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? The whole issue is that if you put the list in the signature, then Cython requires exactly the list signature and not a subclass. So removing the signature and doing casting fixes this. In to_object_array_tuples the argument type is list, which again means that only strict list type can be provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put object type there. It seems equivalent. This is what infer_dtype has.

@jreback
Copy link
Contributor

jreback commented May 30, 2018

why is a List subclass actually useful?

@mitar
Copy link
Contributor Author

mitar commented May 30, 2018

I mean, there are different places where you can get data which is a list subclass. So if you want to convert such data to DataFrame, then why not? For example, we are using a list subclass to have additional methods on the list and additional metadata.

In any case, Pandas is currently inconsistent and something has to be changed. Currently is is checking with isnstance(data, list) if you can call this method, and then calls it, which fails. So or that check is changed, or this method is changed.

I do not understsand why we would not change this method? There is no performance penalty, and it allows easier interoperability with other data. In general Pandas tries to be well integrated with Python ecosystem, so if something is an instance of list, in Python, this means that you can always use it as list. But Pandas currently does not support this.

So, my question is, why not? What is problematic in this pull request?

@jorisvandenbossche
Copy link
Member

I am fine with this patch. It is a minor change and easy to support it.
In any case, the current error message is not really use friendly (due to the inconsistency @mitar notes between the isinstance check and the actual cython function)

@jreback
Copy link
Contributor

jreback commented May 30, 2018

The problem is that, we have many many places where we actually check for lists for various reasons, e.g. its is not a scalar, not a tuple and also we care if things are array-like (IOW have a dtype) or not. So sure this patch looks ok, but this is actually introducing way more inconsistencies that trying to solve one.

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label May 30, 2018
@mitar
Copy link
Contributor Author

mitar commented May 30, 2018

Are you checking elsewhere with type(...) is or with isinstance?

@mitar
Copy link
Contributor Author

mitar commented Jul 24, 2018

Any update here? So as I mentioned, I think there is or a problem in code path that it first checks with isinstance and later on requires fixed list type, or that we fix that later on it also checks with isinstance.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2018

if you want to merge master and update, can see where this is

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

closing as stale. if you'd like to continue, pls ping.

@jreback jreback closed this Nov 23, 2018
@rok
Copy link
Contributor

rok commented Dec 14, 2018

closing as stale. if you'd like to continue, pls ping.

Hey @jreback I'd like to help close this. Could you reopen the PR please so I commit changes to it?

@rok
Copy link
Contributor

rok commented Dec 14, 2018

Hey all, I think the discussion here was already completed and we can merge?

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Dec 14, 2018
@jreback jreback added this to the 0.24.0 milestone Dec 14, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the result of running that benchmarks (before and after). also can you run the DataFrame constructor benchmarks and report.

@@ -2208,7 +2208,7 @@ def map_infer(ndarray arr, object f, bint convert=1):
return result


def to_object_array(rows: list, min_width: int=0):
def to_object_array(rows, int min_width=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave the type annotation

@@ -2208,7 +2208,7 @@ def map_infer(ndarray arr, object f, bint convert=1):
return result


def to_object_array(rows: list, min_width: int=0):
def to_object_array(rows, int min_width=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type rows as object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def to_object_array(rows, int min_width=0):
def to_object_array(rows: object, int min_width=0):

Looks good indeed!

Copy link
Contributor

@rok rok Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

       before           after         ratio
     [a1cf3f65]       [040f06f7]
+     96.3±0.04ms         126±20ms     1.31  reindex.DropDuplicates.time_frame_drop_dups(True)
+     7.32±0.06μs       9.50±0.1μs     1.30  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'unique_monotonic_inc')
+      21.7±0.4μs       25.4±0.2μs     1.17  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+      62.5±0.1μs       71.6±0.7μs     1.15  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
-      9.59±0.3ms       8.45±0.2ms     0.88  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-       126±0.2μs          111±8μs     0.88  index_object.SetOperations.time_operation('datetime', 'union')
-        86.4±4ms         83.4±4μs     0.00  multiindex_object.GetLoc.time_large_get_loc

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'm going to rerun for the ratio > 1 and report back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rerun twice overnight and reported significant changes don't overlap between runs. I assume this is just the benchmark being noisy. (Commit hashes changed because I rebased again)

       before           after         ratio
     [d43ac979]       [8661bbb6]
     <master>         <from-list>
+        50.7±2ms         93.0±3ms     1.84  frame_methods.Iteration.time_iteritems_indexing
+     6.32±0.02μs         7.65±1μs     1.21  timestamp.TimestampOps.time_normalize(tzutc())
+     11.5±0.03μs         13.3±1μs     1.16  offset.OffestDatetimeArithmetic.time_apply(<BusinessMonthEnd>)
+     2.51±0.01μs       2.88±0.2μs     1.15  timeseries.DatetimeIndex.time_get('dst')
+     2.50±0.01μs       2.85±0.2μs     1.14  timeseries.DatetimeIndex.time_get('tz_naive')
+      11.8±0.2μs         13.4±1μs     1.13  timestamp.TimestampConstruction.time_parse_iso8601_tz
+     2.66±0.01μs       3.00±0.2μs     1.13  timeseries.DatetimeIndex.time_get('repeated')
+     20.8±0.07μs         23.1±2μs     1.11  timestamp.TimestampProperties.time_weekday_name(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-      8.40±0.3ms       7.48±0.2ms     0.89  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-      9.62±0.3ms       8.52±0.3ms     0.89  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-      8.85±0.4ms       7.69±0.1ms     0.87  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-      8.68±0.8μs      7.48±0.03μs     0.86  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr')
-      1.13±0.2μs          896±6ns     0.79  period.PeriodProperties.time_property('min', 'day')
-      86.8±0.6ms          100±4μs     0.00  multiindex_object.GetLoc.time_large_get_loc

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

       before           after         ratio
     [d43ac979]       [8661bbb6]
     <master>         <from-list>
+        82.4±1μs          104±1μs     1.26  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+     7.09±0.06μs       8.21±0.8μs     1.16  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'unique_monotonic_inc')
+      18.5±0.3μs         20.6±1μs     1.11  indexing.NonNumericSeriesIndexing.time_get_value('datetime', 'unique_monotonic_inc')
-         803±6ms          714±4ms     0.89  gil.ParallelGroupbyMethods.time_parallel(8, 'mean')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Dec 14, 2018

@WillAyd if you have any commets

@jreback jreback merged commit ca85a41 into pandas-dev:master Dec 15, 2018
@jreback
Copy link
Contributor

jreback commented Dec 15, 2018

thanks @mitar

@mitar mitar deleted the from-list branch December 15, 2018 21:31
@mitar
Copy link
Contributor Author

mitar commented Dec 15, 2018

Thanks @rok.

@rok
Copy link
Contributor

rok commented Dec 15, 2018

Thanks all! :)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot create a DataFrame from list subclass
6 participants