Skip to content

REGR: fix regression in scalar setitem with setting a length-1 array-like #48057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 12, 2022

While testing the various related cases, I noticed that we don't have very consistent behaviour always, see the TODO notes in the tests.
In pandas 1.3, we always unpacked an array-like when setting, also for object dtype. This means that for object dtype, it unpacked it for len-1, and raised an error for longer arrays:

In [31]: df = pd.DataFrame({"A": [1, 2, 3], "B": np.array([1, 'a', 'b'], dtype=object)})

In [32]: df.loc[0, "B"] = np.array([0])

In [33]: df
Out[33]: 
   A  B
0  1  0
1  2  a
2  3  b

In [34]: df.loc[0, "B"] = np.array([0, 1])
...
ValueError: Must have equal len keys and value when setting with an iterable

With the change in #42780, in pandas 1.4 we started to preserve the nested object in case of len-1, but still raising for longer arrays (with object dtype column):

In [24]: df = pd.DataFrame({"A": [1, 2, 3], "B": np.array([1, 'a', 'b'], dtype=object)})

In [25]: df.loc[0, "B"] = np.array([0])

In [26]: df
Out[26]: 
   A    B
0  1  [0]   # <-----
1  2    a
2  3    b

In [27]: df.loc[1, "B"] = np.array([0, 1])
...
ValueError: Must have equal len keys and value when setting with an iterable

This is a bit inconsistent I would say, and I think we should probably long term allow setting nested data of any length in case of object dtype column (the first TODO note).

Because of the above change (no longer unpacking for len-1 array), this started to error for numeric columns (i.e. the actual regression reported in #46268).

What I did in this PR is limiting the change of #42780 to only object dtype, so we keep the new behaviour of preserving the nested object when setting in case object dtype, while fixing the regression for numeric data.
(strictly speaking, we could also say that it is a regression for the object dtype case, but there I think it was actually an improvement in behaviour, and since this was already changed in 1.3.5, and nobody reported it, it seems better to keep that now)

In addition, while testing the same for Series, I noticed that for Series setitem, we only do this unpacking of len-1 array-like values for actual numpy arrays, and not for setting array-likes in general (the second TODO note).

TODO: open new issues to keep track of those TODOs and update those comments

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). labels Aug 12, 2022
@jorisvandenbossche jorisvandenbossche added this to the 1.4.4 milestone Aug 12, 2022
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments


# TODO For object dtype this happens as well, but should we rather preserve
# the nested data and set as such?
df = DataFrame({"A": [1, 2, 3], "B": np.array([1, "a", "b"], dtype=object)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sound good

tm.assert_frame_equal(df, expected)

# but for object dtype we preserve the nested data
df = DataFrame({"A": [1, 2, 3], "B": np.array([1, "a", "b"], dtype=object)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do the above, this is correct I think

def test_scalar_setitem_series_with_nested_value_length1(value, indexer_sli):
# For numeric data, assigning length-1 array to scalar position gets unpacked
# TODO this only happens in case of ndarray, should we make this consistent
# for all list-likes? (as happens for DataFrame.(i)loc, see test above)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this should unpack for everything. The DataFrame case already does

@jorisvandenbossche
Copy link
Member Author

@phofl just to be clear: those TODO comments I added are interesting points (and I should open new issues for those), but fixing those are out of scope of this focused regression fix for 1.4.4 (your feedback on those questions was very welcome! just to set expectations that I don't plan to "fix" those todos here ;)). See also the longer explanation in the top post.

@phofl
Copy link
Member

phofl commented Aug 12, 2022

Yep understood this, no worries. Just wanted to coment in code to avoid misunderstandings with the references

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to merge

@mroeschke mroeschke merged commit 73e0b5e into pandas-dev:main Aug 19, 2022
@lumberbot-app

This comment was marked as resolved.

@mroeschke
Copy link
Member

Thanks @jorisvandenbossche (failures unrelated)

mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Aug 19, 2022
phofl pushed a commit that referenced this pull request Aug 19, 2022
…alar setitem with setting a length-1 array-like) (#48161)

Backport PR #48057: REGR: fix regression in scalar setitem with setting a length-1 array-like

Co-authored-by: Joris Van den Bossche <[email protected]>
@jorisvandenbossche jorisvandenbossche deleted the regr-setitem-scalar-with-nested branch August 20, 2022 06:51
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: ValueError: setting an array element with a sequence. when assigning an array.array using .loc
4 participants