-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Support to create DataFrame from list subclasses #21238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @mitar! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 29, 2018 at 04:52 Hours UTC |
I also ran benchmarks:
I do not see how any of these changes are related to list constructor, so not sure if they are relevant? |
pandas/_libs/src/inference.pyx
Outdated
@@ -1487,7 +1487,7 @@ def map_infer(ndarray arr, object f, bint convert=1): | |||
return result | |||
|
|||
|
|||
def to_object_array(list rows, int min_width=0): | |||
def to_object_array(rows, int min_width=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have any noticeable performance impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry noticed you posted ASVs before but the results should probably come from frame_ctor.py
. On initial glance I didn't see anything to measure construction just from a list - can you double check that and if not add to ensure no major regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No performance impact.
pandas/_libs/src/inference.pyx
Outdated
list row | ||
|
||
input_rows = <list>rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what's the point of this variable if it's never used elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just to type check. :-) So this will throw an exception if it is not really a list or its subclass. I can also use it later on.
8db5eaf
to
403c23d
Compare
Codecov Report
@@ Coverage Diff @@
## master #21238 +/- ##
=======================================
Coverage 92.22% 92.22%
=======================================
Files 162 162
Lines 51828 51828
=======================================
Hits 47798 47798
Misses 4030 4030
Continue to review full report at Codecov.
|
Can you give a timing of the benchmark that you added (I would just do it with |
I ran the benchmark:
|
@jorisvandenbossche I am not sure what you are asking. I looks like one run takes around 60 ms from the results above. |
What I meant was doing (before and after):
as if you want to time a very specific thing, |
Oh, I didn't know about |
pandas/_libs/src/inference.pyx
Outdated
@@ -1508,20 +1508,22 @@ def to_object_array(list rows, int min_width=0): | |||
cdef: | |||
Py_ssize_t i, j, n, k, tmp | |||
ndarray[object, ndim=2] result | |||
list input_rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there not a way to make the type declaration part of the call signature? Just wondering if that wouldn't be cleaner. Also curious if doing it in this fashion doesn't require more memory to construct a DataFrame, which maybe is an issue if dealing with very large lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert in Cython, but this approach is what I found using Google on their mailing list. :-) So not having it in the signature and cast it later. (Casting still throws an exception if impossible.)
I do not see why would this require any more memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May not - just not sure how the casting rules are applied. Looking at the module there appear to be other instances where the type is in the signature (see to_object_array_tuples
). Seems like that would be preferable if its an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? The whole issue is that if you put the list
in the signature, then Cython requires exactly the list signature and not a subclass. So removing the signature and doing casting fixes this. In to_object_array_tuples
the argument type is list
, which again means that only strict list type can be provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put object
type there. It seems equivalent. This is what infer_dtype
has.
why is a List subclass actually useful? |
I mean, there are different places where you can get data which is a list subclass. So if you want to convert such data to DataFrame, then why not? For example, we are using a list subclass to have additional methods on the list and additional metadata. In any case, Pandas is currently inconsistent and something has to be changed. Currently is is checking with I do not understsand why we would not change this method? There is no performance penalty, and it allows easier interoperability with other data. In general Pandas tries to be well integrated with Python ecosystem, so if something is an instance of list, in Python, this means that you can always use it as list. But Pandas currently does not support this. So, my question is, why not? What is problematic in this pull request? |
I am fine with this patch. It is a minor change and easy to support it. |
The problem is that, we have many many places where we actually check for lists for various reasons, e.g. its is not a scalar, not a tuple and also we care if things are array-like (IOW have a dtype) or not. So sure this patch looks ok, but this is actually introducing way more inconsistencies that trying to solve one. |
Are you checking elsewhere with |
Any update here? So as I mentioned, I think there is or a problem in code path that it first checks with isinstance and later on requires fixed list type, or that we fix that later on it also checks with isinstance. |
if you want to merge master and update, can see where this is |
closing as stale. if you'd like to continue, pls ping. |
Hey @jreback I'd like to help close this. Could you reopen the PR please so I commit changes to it? |
Hey all, I think the discussion here was already completed and we can merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show the result of running that benchmarks (before and after). also can you run the DataFrame constructor benchmarks and report.
pandas/_libs/lib.pyx
Outdated
@@ -2208,7 +2208,7 @@ def map_infer(ndarray arr, object f, bint convert=1): | |||
return result | |||
|
|||
|
|||
def to_object_array(rows: list, min_width: int=0): | |||
def to_object_array(rows, int min_width=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave the type annotation
pandas/_libs/lib.pyx
Outdated
@@ -2208,7 +2208,7 @@ def map_infer(ndarray arr, object f, bint convert=1): | |||
return result | |||
|
|||
|
|||
def to_object_array(rows: list, min_width: int=0): | |||
def to_object_array(rows, int min_width=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type rows as object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def to_object_array(rows, int min_width=0): | |
def to_object_array(rows: object, int min_width=0): |
Looks good indeed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before after ratio
[a1cf3f65] [040f06f7]
+ 96.3±0.04ms 126±20ms 1.31 reindex.DropDuplicates.time_frame_drop_dups(True)
+ 7.32±0.06μs 9.50±0.1μs 1.30 indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'unique_monotonic_inc')
+ 21.7±0.4μs 25.4±0.2μs 1.17 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+ 62.5±0.1μs 71.6±0.7μs 1.15 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
- 9.59±0.3ms 8.45±0.2ms 0.88 indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
- 126±0.2μs 111±8μs 0.88 index_object.SetOperations.time_operation('datetime', 'union')
- 86.4±4ms 83.4±4μs 0.00 multiindex_object.GetLoc.time_large_get_loc
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
I'm going to rerun for the ratio > 1 and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rerun twice overnight and reported significant changes don't overlap between runs. I assume this is just the benchmark being noisy. (Commit hashes changed because I rebased again)
before after ratio
[d43ac979] [8661bbb6]
<master> <from-list>
+ 50.7±2ms 93.0±3ms 1.84 frame_methods.Iteration.time_iteritems_indexing
+ 6.32±0.02μs 7.65±1μs 1.21 timestamp.TimestampOps.time_normalize(tzutc())
+ 11.5±0.03μs 13.3±1μs 1.16 offset.OffestDatetimeArithmetic.time_apply(<BusinessMonthEnd>)
+ 2.51±0.01μs 2.88±0.2μs 1.15 timeseries.DatetimeIndex.time_get('dst')
+ 2.50±0.01μs 2.85±0.2μs 1.14 timeseries.DatetimeIndex.time_get('tz_naive')
+ 11.8±0.2μs 13.4±1μs 1.13 timestamp.TimestampConstruction.time_parse_iso8601_tz
+ 2.66±0.01μs 3.00±0.2μs 1.13 timeseries.DatetimeIndex.time_get('repeated')
+ 20.8±0.07μs 23.1±2μs 1.11 timestamp.TimestampProperties.time_weekday_name(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
- 8.40±0.3ms 7.48±0.2ms 0.89 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
- 9.62±0.3ms 8.52±0.3ms 0.89 indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
- 8.85±0.4ms 7.69±0.1ms 0.87 indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
- 8.68±0.8μs 7.48±0.03μs 0.86 indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr')
- 1.13±0.2μs 896±6ns 0.79 period.PeriodProperties.time_property('min', 'day')
- 86.8±0.6ms 100±4μs 0.00 multiindex_object.GetLoc.time_large_get_loc
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
before after ratio
[d43ac979] [8661bbb6]
<master> <from-list>
+ 82.4±1μs 104±1μs 1.26 indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+ 7.09±0.06μs 8.21±0.8μs 1.16 indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'unique_monotonic_inc')
+ 18.5±0.3μs 20.6±1μs 1.11 indexing.NonNumericSeriesIndexing.time_get_value('datetime', 'unique_monotonic_inc')
- 803±6ms 714±4ms 0.89 gil.ParallelGroupbyMethods.time_parallel(8, 'mean')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
@WillAyd if you have any commets |
thanks @mitar |
Thanks @rok. |
Thanks all! :) |
git diff upstream/master -u -- "*.py" | flake8 --diff
Alternative fix could be that instead of doing
isinstance
check in frame's_to_arrays
method:We would do:
This would then assure that
to_object_array
is called really just with exactly lists. But currently there is a mismatch,if
checks with subtypes, whileto_object_array
does not support them. I think it is better to support them so this merge request adds support for subtypes/subclasses of list.