Skip to content

BUG: fix construction of Series from dict with nested lists #18626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -761,6 +761,7 @@ Datetimelike API Changes
- ``pandas.tseries.frequencies.get_freq_group()`` and ``pandas.tseries.frequencies.DAYS`` are removed from the public API (:issue:`18034`)
- :func:`Series.truncate` and :func:`DataFrame.truncate` will raise a ``ValueError`` if the index is not sorted instead of an unhelpful ``KeyError`` (:issue:`17935`)
- Restricted ``DateOffset`` keyword arguments. Previously, ``DateOffset`` subclasses allowed arbitrary keyword arguments which could lead to unexpected behavior. Now, only valid arguments will be accepted. (:issue:`17176`, :issue:`18226`).
- Construction of :class:`Series` from list of length 1 and index of length > 1, which used to interpret the list as a scalar, now raises a ``ValueError`` (:issue:`18626`).
- Subtracting ``NaT`` from a :class:`Series` with ``dtype='datetime64[ns]'`` returns a ``Series`` with ``dtype='timedelta64[ns]'`` instead of ``dtype='datetime64[ns]'`` (:issue:`18808`)
- Operations between a :class:`Series` with dtype ``dtype='datetime64[ns]'`` and a :class:`PeriodIndex` will correctly raises ``TypeError`` (:issue:`18850`)
- Subtraction of :class:`Series` with timezone-aware ``dtype='datetime64[ns]'`` with mis-matched timezones will raise ``TypeError`` instead of ``ValueError`` (:issue:`18817`)
Expand Down Expand Up @@ -1135,6 +1136,7 @@ Reshaping
^^^^^^^^^

- Bug in :func:`DataFrame.stack` which fails trying to sort mixed type levels under Python 3 (:issue:`18310`)
- Fixed construction of a :class:`Series` from a ``dict`` containing nested lists as values (:issue:`18625`)
- Bug in :func:`DataFrame.unstack` which casts int to float if ``columns`` is a ``MultiIndex`` with unused levels (:issue:`17845`)
- Bug in :func:`DataFrame.unstack` which raises an error if ``index`` is a ``MultiIndex`` with unused labels on the unstacked level (:issue:`18562`)
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
Expand Down
39 changes: 12 additions & 27 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -4048,7 +4048,8 @@ def _try_cast(arr, take_fast_path):

# GH #846
if isinstance(data, (np.ndarray, Index, Series)):

if data.ndim > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually should error on data.ndim != 1 (e.g. a numpy scalar of ndim==0 I think hits this path maybe)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add tests to cover this case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we allow numpy scalars? (just like normal scalars work, eg like Series(1))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we do! And this is widely tested (as opposed to the initialization with length 1 list-likes, which was working by accident, accidentally used in the tests, and I now disabled).

raise ValueError('Data must be 1-dimensional')
if dtype is not None:
subarr = np.array(data, copy=False)

Expand Down Expand Up @@ -4085,7 +4086,9 @@ def _try_cast(arr, take_fast_path):
return subarr

elif isinstance(data, (list, tuple)) and len(data) > 0:
if dtype is not None:
if all(is_list_like(item) for item in data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is the appropriate place for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let me try to persuade you that other things were out of place!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, can't we first try the _try_cast and only if that fails / does not give an appropriate sized subarr do this check of all lists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this would be simple to do... but anyway, notice that

  • if data is "sane" (e.g. first element is not a list), then that line has a very small performance cost, since it is lazily evaluated
  • while if data is a list of lists, then going through _try_cast could have a performance cost (but most importantly from my point of view, behavior would be more complex to analyze)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested:

In [79]: %timeit np.array(data)
6.17 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [80]: %timeit all(pd.api.types.is_list_like(item) for item in data)
2.2 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

and the time for such checking is actually negligible compared to the time that the actual conversion to array takes. So indeed not needed for performance reasons.

(if it would not have been the case, I think keeping 'normal' list -> series fast would be higher in priority than the possible slowdown in converting list of lists)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if it would not have been the case, I think keeping 'normal' list -> series fast would be higher in priority than the possible slowdown in converting list of lists)

I totally agree

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way: if/when there will be an ndmax argument to np.array, it will allow us to speed up further the array creation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to move this loggic to _init_listlike or something similar.

subarr = construct_1d_object_array_from_listlike(data)
elif dtype is not None:
try:
subarr = _try_cast(data, False)
except Exception:
Expand All @@ -4107,11 +4110,12 @@ def _try_cast(arr, take_fast_path):
else:
subarr = _try_cast(data, False)

# scalar like, GH
if getattr(subarr, 'ndim', 0) == 0:
if isinstance(data, list): # pragma: no cover
subarr = np.array(data, dtype=object)
elif index is not None:
if subarr.ndim == 0 or is_scalar(data):
if subarr.ndim == 1:
# a scalar upcasted to 1-dimensional by maybe_cast_to_datetime()
value = subarr[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this branch ever hit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.Series(pd.Timestamp('2015-10-10'), index=range(3))

dtype = subarr.dtype
else:
value = data

# figure out the dtype from the value (upcast if necessary)
Expand All @@ -4121,26 +4125,7 @@ def _try_cast(arr, take_fast_path):
# need to possibly convert the value here
value = maybe_cast_to_datetime(value, dtype)

subarr = construct_1d_arraylike_from_scalar(
value, len(index), dtype)

else:
return subarr.item()

# the result that we want
elif subarr.ndim == 1:
if index is not None:

# a 1-element ndarray
if len(subarr) != len(index) and len(subarr) == 1:
subarr = construct_1d_arraylike_from_scalar(
subarr[0], len(index), subarr.dtype)

elif subarr.ndim > 1:
if isinstance(data, np.ndarray):
raise Exception('Data must be 1-dimensional')
else:
subarr = com._asarray_tuplesafe(data, dtype=dtype)
subarr = construct_1d_arraylike_from_scalar(value, len(index), dtype)

# This is to prevent mixed-type Series getting all casted to
# NumPy string type, e.g. NaN --> '-1#IND'.
Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/frame/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,8 @@ def test_apply_broadcast(self):

# scalars
result = self.frame.apply(np.mean, result_type='broadcast')
expected = DataFrame([self.frame.mean()], index=self.frame.index)
expected = DataFrame([self.frame.mean()] * len(self.frame.index),
index=self.frame.index)
tm.assert_frame_equal(result, expected)

result = self.frame.apply(np.mean, axis=1, result_type='broadcast')
Expand Down
3 changes: 1 addition & 2 deletions pandas/tests/indexes/test_multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -3215,8 +3215,7 @@ def test_nan_stays_float(self):
assert pd.isna(idx0.get_level_values(1)).all()
# the following failed in 0.14.1
assert pd.isna(idxm.get_level_values(1)[:-1]).all()

df0 = pd.DataFrame([[1, 2]], index=idx0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the issue here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the (outer) list has a different length than the index

df0 = pd.DataFrame([[1, 2]] * 2, index=idx0)
df1 = pd.DataFrame([[3, 4]], index=idx1)
dfm = df0 - df1
assert pd.isna(df0.index.get_level_values(1)).all()
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/reshape/test_reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -496,7 +496,7 @@ def test_preserve_categorical_dtype(self):
cidx = pd.CategoricalIndex(list("xyz"), ordered=ordered)
midx = pd.MultiIndex(levels=[['a'], cidx],
labels=[[0, 0], [0, 1]])
df = DataFrame([[10, 11]], index=midx)
df = DataFrame([[10, 11]] * 2, index=midx)

expected = DataFrame([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]],
index=midx, columns=cidx)
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/series/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@ def test_scalar_conversion(self):
assert int(Series([1.])) == 1
assert long(Series([1.])) == 1

@pytest.mark.parametrize('scalar', [1, 'abc', {2, 3}])
@pytest.mark.parametrize('index', [range(2), ['a', 'b']])
def test_invalid_1_dimensional(self, scalar, index):
pytest.raises(ValueError, Series, [scalar], index=index)

def test_constructor(self):
assert self.ts.index.is_all_dates

Expand Down Expand Up @@ -828,6 +833,17 @@ def test_constructor_dict_order(self):
expected = Series([0, 1, 2], index=list('abc'))
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize('input_class', [list, tuple, iter])
@pytest.mark.parametrize('dtype', ['object', None])
def test_constructor_dict_nested_lists(self, input_class, dtype):
# GH 18625
d = {'a': input_class([input_class([1, 2, 3]),
input_class([4, 5, 6])]),
'b': input_class([input_class([7, 8, 9])])}
result = Series(d, index=['a', 'b'], dtype=dtype)
expected = Series([d['a'], d['b']], index=['a', 'b'])
assert_series_equal(result, expected)

@pytest.mark.parametrize("value", [2, np.nan, None, float('nan')])
def test_constructor_dict_nan_key(self, value):
# GH 18480
Expand Down