Skip to content

BUG: fix construction of Series from dict with nested lists #18626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Dec 4, 2017

@@ -3208,7 +3208,11 @@ def _try_cast(arr, take_fast_path):
return subarr

elif isinstance(data, (list, tuple)) and len(data) > 0:
if dtype is not None:
if all(is_list_like(item) for item in data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is the appropriate place for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let me try to persuade you that other things were out of place!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, can't we first try the _try_cast and only if that fails / does not give an appropriate sized subarr do this check of all lists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this would be simple to do... but anyway, notice that

  • if data is "sane" (e.g. first element is not a list), then that line has a very small performance cost, since it is lazily evaluated
  • while if data is a list of lists, then going through _try_cast could have a performance cost (but most importantly from my point of view, behavior would be more complex to analyze)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested:

In [79]: %timeit np.array(data)
6.17 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [80]: %timeit all(pd.api.types.is_list_like(item) for item in data)
2.2 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

and the time for such checking is actually negligible compared to the time that the actual conversion to array takes. So indeed not needed for performance reasons.

(if it would not have been the case, I think keeping 'normal' list -> series fast would be higher in priority than the possible slowdown in converting list of lists)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if it would not have been the case, I think keeping 'normal' list -> series fast would be higher in priority than the possible slowdown in converting list of lists)

I totally agree

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way: if/when there will be an ndmax argument to np.array, it will allow us to speed up further the array creation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to move this loggic to _init_listlike or something similar.

@toobaz toobaz force-pushed the series_of_lists_of_lists_18625 branch from a45ce6c to c5015c1 Compare December 4, 2017 13:52
toobaz added a commit to toobaz/pandas that referenced this pull request Dec 4, 2017
@codecov
Copy link

codecov bot commented Dec 4, 2017

Codecov Report

Merging #18626 into master will decrease coverage by 0.17%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18626      +/-   ##
==========================================
- Coverage   91.59%   91.42%   -0.18%     
==========================================
  Files         155      157       +2     
  Lines       51255    51457     +202     
==========================================
+ Hits        46949    47046      +97     
- Misses       4306     4411     +105
Flag Coverage Δ
#multiple 89.29% <100%> (-0.15%) ⬇️
#single 40.6% <75%> (-0.18%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 94.82% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/excel.py 80.39% <0%> (-9.74%) ⬇️
pandas/plotting/_converter.py 63.44% <0%> (-1.82%) ⬇️
pandas/compat/__init__.py 58.18% <0%> (-0.59%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/core/sparse/api.py 100% <0%> (ø) ⬆️
pandas/compat/openpyxl_compat.py 90.9% <0%> (ø)
pandas/core/sparse/list.py 97.1% <0%> (ø)
pandas/core/tools/datetimes.py 84.48% <0%> (+0.05%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c903d5...c5015c1. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 4, 2017

Codecov Report

Merging #18626 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18626      +/-   ##
==========================================
+ Coverage   91.82%   91.85%   +0.03%     
==========================================
  Files         152      152              
  Lines       49255    49252       -3     
==========================================
+ Hits        45226    45239      +13     
+ Misses       4029     4013      -16
Flag Coverage Δ
#multiple 90.23% <100%> (+0.03%) ⬆️
#single 41.91% <90%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 94.07% <100%> (+0.17%) ⬆️
pandas/util/testing.py 84.73% <0%> (+0.2%) ⬆️
pandas/plotting/_converter.py 66.81% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4efb39f...14f0443. Read the comment docs.

@toobaz
Copy link
Member Author

toobaz commented Dec 4, 2017

The new version scans nested types once rather than twice, but there is room for improvements, which I will push tomorrow.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 5, 2017
@@ -3177,7 +3177,8 @@ def _try_cast(arr, take_fast_path):

# GH #846
if isinstance(data, (np.ndarray, Index, Series)):

if data.ndim > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually should error on data.ndim != 1 (e.g. a numpy scalar of ndim==0 I think hits this path maybe)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add tests to cover this case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we allow numpy scalars? (just like normal scalars work, eg like Series(1))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we do! And this is widely tested (as opposed to the initialization with length 1 list-likes, which was working by accident, accidentally used in the tests, and I now disabled).

@toobaz toobaz force-pushed the series_of_lists_of_lists_18625 branch from c5015c1 to d81ce3a Compare December 5, 2017 10:55
toobaz added a commit to toobaz/pandas that referenced this pull request Dec 5, 2017
@toobaz toobaz force-pushed the series_of_lists_of_lists_18625 branch from d81ce3a to 16e21fd Compare December 5, 2017 10:59
toobaz added a commit to toobaz/pandas that referenced this pull request Dec 5, 2017
@@ -462,7 +462,7 @@ def test_preserve_categorical_dtype(self):
cidx = pd.CategoricalIndex(list("xyz"), ordered=ordered)
midx = pd.MultiIndex(levels=[['a'], cidx],
labels=[[0, 0], [0, 1]])
df = DataFrame([[10, 11]], index=midx)
df = DataFrame([[10, 11]] * 2, index=midx, columns=[0, 1])
Copy link
Member Author

@toobaz toobaz Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This columns=[0,1] is a typo. Waiting for any comment on the rest, then I will remove it.

@@ -3208,7 +3208,11 @@ def _try_cast(arr, take_fast_path):
return subarr

elif isinstance(data, (list, tuple)) and len(data) > 0:
if dtype is not None:
if all(is_list_like(item) for item in data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to move this loggic to _init_listlike or something similar.

value = data
if subarr.ndim == 0 or is_scalar(data):
if index is None:
return subarr.item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very odd, are you actually returning a scalar on construction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea of what could possibly hit this (i.g. when index is None), and it does look broken (e.g. BlockManager won't accept it), so I removed it, but if you want to stay on the safe side just discard the last commit I just pushed.


if subarr.ndim == 1:
# a scalar upcasted to 1-dimensional by maybe_cast_to_datetime()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so would much rather fix this case inside maybe_cast_to_datetime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (limited) understanding is that it is impossible, basically because

def broken_datetimeindex_roundtrip(dtix):
    scalar = dtix.values[0, ...]
    return pd.DatetimeIndex(np.array([scalar], dtype=scalar.dtype))

looses tz and freq.
This in turn is probably not trivial at all to fix (or more honestly: I have no idea).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looses tz and freq.

Uhm... it probably makes perfect sense that it looses freq (but not tz).

@@ -3066,8 +3066,7 @@ def test_nan_stays_float(self):
assert pd.isna(idx0.get_level_values(1)).all()
# the following failed in 0.14.1
assert pd.isna(idxm.get_level_values(1)[:-1]).all()

df0 = pd.DataFrame([[1, 2]], index=idx0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the issue here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the (outer) list has a different length than the index

@@ -651,6 +656,17 @@ def test_constructor_dict(self):
expected.iloc[1] = 1
assert_series_equal(result, expected)

@pytest.mark.parametrize('input_class', [list, tuple])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we also accept a duck-typed list class that is not a generator here? (IOW an iterable).

@toobaz toobaz force-pushed the series_of_lists_of_lists_18625 branch from 16e21fd to 707abf2 Compare December 6, 2017 10:12
toobaz added a commit to toobaz/pandas that referenced this pull request Dec 6, 2017
@@ -381,25 +401,17 @@ def _asarray_tuplesafe(values, dtype=None):
return values.values

if isinstance(values, list) and dtype in [np.object_, object]:
return lib.list_to_object_array(values)
return _as_1d_array(values)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice this is more efficient than the cython version.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2017

can u run the series construction asv for this
we may not have enough coverage of the various cases as well

@toobaz toobaz force-pushed the series_of_lists_of_lists_18625 branch from 707abf2 to b5225ee Compare December 6, 2017 13:59
toobaz added a commit to toobaz/pandas that referenced this pull request Dec 6, 2017
@toobaz
Copy link
Member Author

toobaz commented Dec 6, 2017

can u run the series construction asv for this

       before           after         ratio
     [fdba1333]       [b5225ee5]
-         263±4ms        235±0.6ms     0.89  series_methods.series_constructor_no_data_datetime_index.time_series_constructor_no_data_datetime_index
-     1.67±0.02ms      1.41±0.02ms     0.84  series_methods.series_value_counts.time_value_counts_int64

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

(but they are borderline, they don't appear on every run)

@jreback
Copy link
Contributor

jreback commented Dec 6, 2017

can u show the series constructions ones
i don’t thing we cover other than the basic cases

@toobaz
Copy link
Member Author

toobaz commented Dec 6, 2017

can u show the series constructions ones

Sorry, I had missed ctors.py, here is the result:

     [fdba1333]       [b5225ee5]
-     13.8±0.07μs      12.4±0.08μs     0.90  ctors.Constructors.time_series_from_ndarray
-     11.0±0.03μs      9.92±0.05μs     0.90  ctors.Constructors.time_dtindex_from_series
-        25.2±3μs      22.3±0.05μs     0.88  ctors.Constructors.time_dtindex_from_index_with_series

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

These are from frame_ctor.py instead:

       before           after         ratio
     [fdba1333]       [b5225ee5]
-     3.98±0.02ms      3.61±0.01ms     0.91  frame_ctor.FromDicts.time_series_ctor_from_dict
-      76.0±0.4ms       68.3±0.2ms     0.90  frame_ctor.FromDicts.time_frame_ctor_nested_dict_int64
-      33.6±0.2ms       30.1±0.1ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('QuarterBegin', 1)
-      30.5±0.4ms       27.3±0.2ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CDay', 2)
-      30.4±0.2ms      27.1±0.08ms     0.89  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessDay', 2)
-      30.8±0.1ms       27.4±0.1ms     0.89  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BusinessDay', 1)
-      30.6±0.3ms      27.1±0.07ms     0.89  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CDay', 1)
-      30.7±0.2ms      27.0±0.07ms     0.88  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
-      31.3±0.2ms       27.5±0.3ms     0.88  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Day', 2)
-      32.0±0.4ms       28.0±0.5ms     0.87  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CBMonthEnd', 1)
-      32.2±0.5ms       27.8±0.3ms     0.86  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Day', 1)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2017

those benchmarks are mostly not on point. e.g. to we have ones which test list-of-list, dict-of-list, dict-of-array, dict-of-series etc. for frame

list, tuple, dict, array, Index, Series for series

@@ -372,6 +372,26 @@ def intersection(*seqs):
return type(seqs[0])(list(result))


def _as_1d_array(values, dtype='object'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to pandas/core/dtypes/cast.py (at the bottom); rename in that style as well.

can move _asarray_tuplesafe as well (here or future PR)

-------
1-dimensional numpy array
"""
# Making a 1D array that safely contains list-likes is a bit tricky
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why if you are leaving a comment like this


result = np.asarray(values, dtype=dtype)

if issubclass(result.dtype.type, compat.string_types):
result = np.asarray(values, dtype=object)

if result.ndim == 2:
if isinstance(values, list):
return lib.list_to_object_array(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is seems like list basically replaces lib.list_to_object_array if that is so, then we should simply do that (it can live in pandas.core.dtypes.cast as well).

if hasattr(values, '__array__'):
             values = [tuple(x) for x in values     
result = _as_1d_array(values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, _as_1d_array can entirely replace lib.list_to_object_array. My only doubt is that the latter is used in some cython files... I have no idea if in those cases using a cython function has advantages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had missed the second part of your comment.And I don't understand it (isn't it what I'm doing?)

@@ -3212,7 +3214,9 @@ def _try_cast(arr, take_fast_path):
return subarr

elif isinstance(data, (list, tuple)) and len(data) > 0:
if dtype is not None:
if all(is_list_like(item) for item in data):
subarr = _as_1d_array(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

if subarr.ndim == 0 or is_scalar(data):
if subarr.ndim == 1:
# a scalar upcasted to 1-dimensional by maybe_cast_to_datetime()
value = subarr[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this branch ever hit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.Series(pd.Timestamp('2015-10-10'), index=range(3))

@toobaz
Copy link
Member Author

toobaz commented Dec 7, 2017

those benchmarks are mostly not on point. e.g. to we have ones which test

For frames there's the entire frame_ctor.py, but for Series all I could find was series_constructor_dict_data_datetime_index in series_methods.py and time_series_from_ndarray in ctors.py. They just didn't exhibit any significant change.

@toobaz toobaz force-pushed the series_of_lists_of_lists_18625 branch from b5225ee to f56134b Compare December 7, 2017 07:23
@jorisvandenbossche
Copy link
Member

For the change of list of one element (previously broadcasted, now error), I would do this as deprecation instead of a hard breaking change. Thoughts?
(it is quite easy to do for us I think, and if even Tom writes code using that .. :-))

@toobaz
Copy link
Member Author

toobaz commented Dec 7, 2017

(it is quite easy to do for us I think, and if even Tom writes code using that .. :-))

You mean the tests? They look to me as accidental uses of this "feature" - after all, just passing the scalar is simpler.

Anyway, I'm a bit against deprecation in this case because the docs are quite clear, but sure, it can be done.

@jorisvandenbossche
Copy link
Member

They look to me as accidental uses of this "feature" - after all, just passing the scalar is simpler.

Yes, accidental, but many people can have done it "accidental", and since it works perfectly there was no incentive to fix your code.

@jreback
Copy link
Contributor

jreback commented Dec 8, 2017

For frames there's the entire frame_ctor.py, but for Series all I could find was series_constructor_dict_data_datetime_index in series_methods.py and time_series_from_ndarray in ctors.py. They just didn't exhibit any significant change.

ok, as indicated above, pls add some.

@@ -1162,3 +1162,23 @@ def construct_1d_arraylike_from_scalar(value, length, dtype):
subarr.fill(value)

return subarr


def cast_to_1d_array(values, dtype='object'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename like the above, e.g.

construct_1d_array_from_listlike


Returns
-------
1-dimensional numpy array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of object type

1-dimensional numpy array
"""
# numpy will try to interpret nested lists as further dimensions, hence
# making a 1D array that contains list-likes is a bit tricky:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests for this, e.g. cases would be iterables, including a list of strings with one or more np.nan elements, and throw in some Nones

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

rebase and will have a look

@jorisvandenbossche
Copy link
Member

@toobaz status of this?

@toobaz
Copy link
Member Author

toobaz commented Feb 9, 2018

@toobaz status of this?

I'll rebase on #18600 when merged

@toobaz toobaz force-pushed the series_of_lists_of_lists_18625 branch from f56134b to 14f0443 Compare April 2, 2018 10:02
@toobaz
Copy link
Member Author

toobaz commented Apr 2, 2018

Rebased.

We need to decide what to do about the "length 1 = scalar" thing. I thought we could just break it since it was never documented, is conceptually wrong, and supporting it complicates our code. Joris suggests a deprecation cycle instead. Could a compromise make sense, in which we remove support but also insert (for a couple of releases) an informative warning saying "broadcasting [...] was never officially supported, but it worked accidentally, support was dropped in version 0.23, replace EXAMPLE_CODE with FIXED_EXAMPLE")?

@jreback
Copy link
Contributor

jreback commented Apr 2, 2018

we already changed the length 1 = scalar iirc

are you suggesting we should revert and deprecate instead?

@toobaz
Copy link
Member Author

toobaz commented Apr 2, 2018

we already changed the length 1 = scalar iirc

Sorry, I got confused. True, we have disabled pd.Series([1], index=range(3)). What I should have referred to is pd.DataFrame([[1]], index=range(3)), which still works, and this PR disables.

However, the thing is slightly more complicated than I thought. My idea was that (in analogy with the Series case) pd.DataFrame([[1]], index=range(3)) should have been replaced with pd.DataFrame([1], index=range(3)) (as in "broadcast if you have lower dimensional value"). However, that won't work, because [1] gets aligned vertically. And by the way, this "feature" is also broken, since pd.DataFrame([1, 2, 3], index=range(3)) works (but could be simply be written as pd.DataFrame([[1], [2], [3]], index=range(3))), while pd.DataFrame([1, 2, 3], columns=20) (which would be the useful case) doesn't.

So, ideal world: if a lower dim object obj is passed as data, it is interpreted as [obj] * len(index), both in Series and in DataFrame.

Acceptable world: Series remain as they are, but a 1d list obj passed as data for a DataFrame is interpreted as [[i]*len(columns) for i in obj].

If the ideal world is too disruptive (backward incompatible), we can at least aim for the acceptable one. In this case, I'm no more sure we want to drop support for pd.DataFrame([[1]], index=range(3)).

@toobaz
Copy link
Member Author

toobaz commented Apr 27, 2018

However, the thing is slightly more complicated than I thought.

Answered myself in #20837 . Will update this PR accordingly (not in time for 0.23.0).

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

can you rebase & move note to 0.24

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

can you rebase

@jreback
Copy link
Contributor

jreback commented Dec 14, 2018

closing as stale, but pls reopen if can continue

@jreback jreback closed this Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initializing Series with dict, list of lists values and explicit dtype raises ValueError
3 participants