-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix construction of Series from dict with nested lists #18626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -3208,7 +3208,11 @@ def _try_cast(arr, take_fast_path): | |||
return subarr | |||
|
|||
elif isinstance(data, (list, tuple)) and len(data) > 0: | |||
if dtype is not None: | |||
if all(is_list_like(item) for item in data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this is the appropriate place for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(let me try to persuade you that other things were out of place!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance, can't we first try the _try_cast
and only if that fails / does not give an appropriate sized subarr do this check of all lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this would be simple to do... but anyway, notice that
- if
data
is "sane" (e.g. first element is not a list), then that line has a very small performance cost, since it is lazily evaluated - while if
data
is a list of lists, then going through_try_cast
could have a performance cost (but most importantly from my point of view, behavior would be more complex to analyze)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested:
In [79]: %timeit np.array(data)
6.17 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [80]: %timeit all(pd.api.types.is_list_like(item) for item in data)
2.2 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and the time for such checking is actually negligible compared to the time that the actual conversion to array takes. So indeed not needed for performance reasons.
(if it would not have been the case, I think keeping 'normal' list -> series fast would be higher in priority than the possible slowdown in converting list of lists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if it would not have been the case, I think keeping 'normal' list -> series fast would be higher in priority than the possible slowdown in converting list of lists)
I totally agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way: if/when there will be an ndmax
argument to np.array
, it will allow us to speed up further the array creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to move this loggic to _init_listlike
or something similar.
a45ce6c
to
c5015c1
Compare
Codecov Report
@@ Coverage Diff @@
## master #18626 +/- ##
==========================================
- Coverage 91.59% 91.42% -0.18%
==========================================
Files 155 157 +2
Lines 51255 51457 +202
==========================================
+ Hits 46949 47046 +97
- Misses 4306 4411 +105
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18626 +/- ##
==========================================
+ Coverage 91.82% 91.85% +0.03%
==========================================
Files 152 152
Lines 49255 49252 -3
==========================================
+ Hits 45226 45239 +13
+ Misses 4029 4013 -16
Continue to review full report at Codecov.
|
The new version scans nested types once rather than twice, but there is room for improvements, which I will push tomorrow. |
@@ -3177,7 +3177,8 @@ def _try_cast(arr, take_fast_path): | |||
|
|||
# GH #846 | |||
if isinstance(data, (np.ndarray, Index, Series)): | |||
|
|||
if data.ndim > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually should error on data.ndim != 1 (e.g. a numpy scalar of ndim==0 I think hits this path maybe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add tests to cover this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we allow numpy scalars? (just like normal scalars work, eg like Series(1)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we do! And this is widely tested (as opposed to the initialization with length 1 list-likes, which was working by accident, accidentally used in the tests, and I now disabled).
c5015c1
to
d81ce3a
Compare
d81ce3a
to
16e21fd
Compare
pandas/tests/reshape/test_reshape.py
Outdated
@@ -462,7 +462,7 @@ def test_preserve_categorical_dtype(self): | |||
cidx = pd.CategoricalIndex(list("xyz"), ordered=ordered) | |||
midx = pd.MultiIndex(levels=[['a'], cidx], | |||
labels=[[0, 0], [0, 1]]) | |||
df = DataFrame([[10, 11]], index=midx) | |||
df = DataFrame([[10, 11]] * 2, index=midx, columns=[0, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This columns=[0,1]
is a typo. Waiting for any comment on the rest, then I will remove it.
@@ -3208,7 +3208,11 @@ def _try_cast(arr, take_fast_path): | |||
return subarr | |||
|
|||
elif isinstance(data, (list, tuple)) and len(data) > 0: | |||
if dtype is not None: | |||
if all(is_list_like(item) for item in data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to move this loggic to _init_listlike
or something similar.
pandas/core/series.py
Outdated
value = data | ||
if subarr.ndim == 0 or is_scalar(data): | ||
if index is None: | ||
return subarr.item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very odd, are you actually returning a scalar on construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea of what could possibly hit this (i.g. when index is None
), and it does look broken (e.g. BlockManager
won't accept it), so I removed it, but if you want to stay on the safe side just discard the last commit I just pushed.
pandas/core/series.py
Outdated
|
||
if subarr.ndim == 1: | ||
# a scalar upcasted to 1-dimensional by maybe_cast_to_datetime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so would much rather fix this case inside maybe_cast_to_datetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My (limited) understanding is that it is impossible, basically because
def broken_datetimeindex_roundtrip(dtix):
scalar = dtix.values[0, ...]
return pd.DatetimeIndex(np.array([scalar], dtype=scalar.dtype))
looses tz
and freq
.
This in turn is probably not trivial at all to fix (or more honestly: I have no idea).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looses tz and freq.
Uhm... it probably makes perfect sense that it looses freq
(but not tz
).
@@ -3066,8 +3066,7 @@ def test_nan_stays_float(self): | |||
assert pd.isna(idx0.get_level_values(1)).all() | |||
# the following failed in 0.14.1 | |||
assert pd.isna(idxm.get_level_values(1)[:-1]).all() | |||
|
|||
df0 = pd.DataFrame([[1, 2]], index=idx0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the (outer) list has a different length than the index
@@ -651,6 +656,17 @@ def test_constructor_dict(self): | |||
expected.iloc[1] = 1 | |||
assert_series_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize('input_class', [list, tuple]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we also accept a duck-typed list class that is not a generator here? (IOW an iterable).
16e21fd
to
707abf2
Compare
pandas/core/common.py
Outdated
@@ -381,25 +401,17 @@ def _asarray_tuplesafe(values, dtype=None): | |||
return values.values | |||
|
|||
if isinstance(values, list) and dtype in [np.object_, object]: | |||
return lib.list_to_object_array(values) | |||
return _as_1d_array(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice this is more efficient than the cython version.
can u run the series construction asv for this |
707abf2
to
b5225ee
Compare
(but they are borderline, they don't appear on every run) |
can u show the series constructions ones |
Sorry, I had missed
These are from
|
those benchmarks are mostly not on point. e.g. to we have ones which test list-of-list, dict-of-list, dict-of-array, dict-of-series etc. for frame list, tuple, dict, array, Index, Series for series |
pandas/core/common.py
Outdated
@@ -372,6 +372,26 @@ def intersection(*seqs): | |||
return type(seqs[0])(list(result)) | |||
|
|||
|
|||
def _as_1d_array(values, dtype='object'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to pandas/core/dtypes/cast.py (at the bottom); rename in that style as well.
can move _asarray_tuplesafe as well (here or future PR)
pandas/core/common.py
Outdated
------- | ||
1-dimensional numpy array | ||
""" | ||
# Making a 1D array that safely contains list-likes is a bit tricky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain why if you are leaving a comment like this
pandas/core/common.py
Outdated
|
||
result = np.asarray(values, dtype=dtype) | ||
|
||
if issubclass(result.dtype.type, compat.string_types): | ||
result = np.asarray(values, dtype=object) | ||
|
||
if result.ndim == 2: | ||
if isinstance(values, list): | ||
return lib.list_to_object_array(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is seems like list basically replaces lib.list_to_object_array
if that is so, then we should simply do that (it can live in pandas.core.dtypes.cast as well).
if hasattr(values, '__array__'):
values = [tuple(x) for x in values
result = _as_1d_array(values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, _as_1d_array
can entirely replace lib.list_to_object_array
. My only doubt is that the latter is used in some cython files... I have no idea if in those cases using a cython function has advantages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had missed the second part of your comment.And I don't understand it (isn't it what I'm doing?)
pandas/core/series.py
Outdated
@@ -3212,7 +3214,9 @@ def _try_cast(arr, take_fast_path): | |||
return subarr | |||
|
|||
elif isinstance(data, (list, tuple)) and len(data) > 0: | |||
if dtype is not None: | |||
if all(is_list_like(item) for item in data): | |||
subarr = _as_1d_array(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
if subarr.ndim == 0 or is_scalar(data): | ||
if subarr.ndim == 1: | ||
# a scalar upcasted to 1-dimensional by maybe_cast_to_datetime() | ||
value = subarr[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this branch ever hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.Series(pd.Timestamp('2015-10-10'), index=range(3))
For frames there's the entire |
b5225ee
to
f56134b
Compare
For the change of list of one element (previously broadcasted, now error), I would do this as deprecation instead of a hard breaking change. Thoughts? |
You mean the tests? They look to me as accidental uses of this "feature" - after all, just passing the scalar is simpler. Anyway, I'm a bit against deprecation in this case because the docs are quite clear, but sure, it can be done. |
Yes, accidental, but many people can have done it "accidental", and since it works perfectly there was no incentive to fix your code. |
ok, as indicated above, pls add some. |
pandas/core/dtypes/cast.py
Outdated
@@ -1162,3 +1162,23 @@ def construct_1d_arraylike_from_scalar(value, length, dtype): | |||
subarr.fill(value) | |||
|
|||
return subarr | |||
|
|||
|
|||
def cast_to_1d_array(values, dtype='object'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename like the above, e.g.
construct_1d_array_from_listlike
pandas/core/dtypes/cast.py
Outdated
|
||
Returns | ||
------- | ||
1-dimensional numpy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of object type
pandas/core/dtypes/cast.py
Outdated
1-dimensional numpy array | ||
""" | ||
# numpy will try to interpret nested lists as further dimensions, hence | ||
# making a 1D array that contains list-likes is a bit tricky: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some tests for this, e.g. cases would be iterables, including a list of strings with one or more np.nan elements, and throw in some Nones
rebase and will have a look |
@toobaz status of this? |
f56134b
to
14f0443
Compare
Rebased. We need to decide what to do about the "length 1 = scalar" thing. I thought we could just break it since it was never documented, is conceptually wrong, and supporting it complicates our code. Joris suggests a deprecation cycle instead. Could a compromise make sense, in which we remove support but also insert (for a couple of releases) an informative warning saying "broadcasting [...] was never officially supported, but it worked accidentally, support was dropped in version 0.23, replace EXAMPLE_CODE with FIXED_EXAMPLE")? |
we already changed the length 1 = scalar iirc are you suggesting we should revert and deprecate instead? |
Sorry, I got confused. True, we have disabled However, the thing is slightly more complicated than I thought. My idea was that (in analogy with the Series case) So, ideal world: if a lower dim object Acceptable world: If the ideal world is too disruptive (backward incompatible), we can at least aim for the acceptable one. In this case, I'm no more sure we want to drop support for |
Answered myself in #20837 . Will update this PR accordingly (not in time for 0.23.0). |
can you rebase & move note to 0.24 |
can you rebase |
closing as stale, but pls reopen if can continue |
git diff upstream/master -u -- "*.py" | flake8 --diff