BUG: Inconsistency for iloc between np.array and pd.array #40933

phofl · 2021-04-13T21:44:48Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

We got some inconsistencies with iloc between setting a numpy array and a pandas array

df = pd.DataFrame(data={
    'col1': [1, 2, 3, 4],
    'col2': [3, 4, 5, 6],
    'col3': [6, 7, 8, 9],
})
rhs = pd.array([1, 2, 3])
df.iloc[[1, 2, 3]] = rhs

Problem description

This returns

   col1  col2  col3
0     1     3     6
1     1     1     1
2     2     2     2
3     3     3     3

When switching the rhs to a numpy array, this returns

rhs = np.array([1, 2, 3])

   col1  col2  col3
0     1     3     6
1     1     2     3
2     1     2     3
3     1     2     3

Same happens when using [[1, 2, 3], :] as indexer

Expected Output

We should be consistent here, if we want to change iloc for the numpy case I think we need to deprecate?

Is this already known?
cc @jbrockmendel @jorisvandenbossche

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : c8e077f python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-48-generic Version : #54~20.04.1-Ubuntu SMP Sat Mar 20 13:40:25 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.0.dev0+1320.gc8e077fa87.dirty
numpy : 1.20.0
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.1
hypothesis : 6.0.2
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : 0.7.1
matplotlib : 3.3.3
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 1.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.0
sqlalchemy : 1.3.22
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0
None

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-04-14T18:01:30Z

and a pandas array

just to be clear, rhs = pd.array([1, 2, 3]) gives an IntegerArray, not a PandasArray, right?

We should be consistent here, if we want to change iloc for the numpy case I think we need to deprecate?

I think the behavior with ndarray is "more correct" here. Do you disagree?

phofl · 2021-04-14T18:25:25Z

just to be clear, rhs = pd.array([1, 2, 3]) gives an IntegerArray, not a PandasArray, right?

Sorry, I meant the pandas array constructor. Yes this returns an IntegerArray.

I think the behavior with ndarray is "more correct" here. Do you disagree?

Yeah I think so too

parkyo · 2021-04-14T18:56:47Z

take

…oth numpy and pandas arrays to fix the issue pandas-dev#40933

parkyo · 2021-04-19T01:22:40Z

So it looks like there is no issue. The reason why you got different outputs was because you modified df directly once with pandas array and then applied iloc with a numpy array to the modified dataframe. Try it by declaring two dataframes for numpy array and pandas array like below.

phofl · 2021-04-19T07:31:00Z

Your test is wrong.
Adding copy to the DataFrame assignment raises as expceted:

df = DataFrame(
    data={
        "col1": [1, 2, 3, 4],
        "col2": [3, 4, 5, 6],
        "col3": [6, 7, 8, 9],
    }
)
df_np = df.copy()
df_pd = df.copy()
np_arr = np.array([1, 2, 3])
pd_arr = pd.array([1, 2, 3])
df_np.iloc[[1, 2, 3]] = np_arr
df_pd.iloc[[1, 2, 3]] = pd_arr
tm.assert_frame_equal(df_np, df_pd)

phofl · 2021-04-23T16:45:39Z

@parkyo are you still working on this?

parkyo · 2021-04-23T16:53:43Z

@phofl Yes! I just made the pull request
#41028

phofl · 2021-05-01T22:01:28Z

@jbrockmendel got a follow up question here:

In 2d08672 you've added the testcase test_iloc_setitem_ea_inplace which basically comes down to:

df = DataFrame([1, 2, 3, 4])
rhs = pd.array([3, 4])
df.iloc[:2] = rhs

Which returns

This contradicts the non-ea case, which returns:

rhs = np.array([1, 2, 3, 4])
obj = DataFrame([1, 2, 3, 4])
obj.iloc[:2] = rhs

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2021.1/scratches/scratch_4.py", line 553, in <module>
    obj.iloc[:2] = arr
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 717, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1713, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1949, in _setitem_single_block
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 369, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 338, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 941, in setitem
    check_setitem_lengths(indexer, value, values)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexers.py", line 184, in check_setitem_lengths
    raise ValueError(
ValueError: cannot set using a slice indexer with a different length than the value

This tries to set row-wise.
If we go with the non-ea case we have to turn that back and change the test. Would you agree?

jbrockmendel · 2021-05-01T22:29:21Z

Yah, I think it would make more sense for the obj.iloc[:2] = rhs to be obj.iloc[:2, 0] = rhs. Would need to be sure that still goes through the affected code path.

github-actions bot assigned parkyo Apr 14, 2021

parkyo pushed a commit to parkyo/pandas that referenced this issue Apr 19, 2021

added test case for iloc function if it returns the same output for b…

f57718e

…oth numpy and pandas arrays to fix the issue pandas-dev#40933

parkyo mentioned this issue Apr 19, 2021

added test case for iloc function if it returns the same output for b… #41028

Closed

4 tasks

This was referenced May 2, 2021

Bug in setitem raising ValueError with row-slice indexer on df with list-like on rhs #41268

Merged

Bug in iloc.setitem orienting IntegerArray into the wrong direction #41288

Merged

jreback added this to the 1.3 milestone May 4, 2021

jreback closed this as completed in #41288 May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Inconsistency for iloc between np.array and pd.array #40933

BUG: Inconsistency for iloc between np.array and pd.array #40933

phofl commented Apr 13, 2021 •

edited

Loading

jbrockmendel commented Apr 14, 2021

phofl commented Apr 14, 2021

parkyo commented Apr 14, 2021

parkyo commented Apr 19, 2021

phofl commented Apr 19, 2021

phofl commented Apr 23, 2021

parkyo commented Apr 23, 2021 •

edited

Loading

phofl commented May 1, 2021 •

edited

Loading

jbrockmendel commented May 1, 2021

BUG: Inconsistency for iloc between np.array and pd.array #40933

BUG: Inconsistency for iloc between np.array and pd.array #40933

Comments

phofl commented Apr 13, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

jbrockmendel commented Apr 14, 2021

phofl commented Apr 14, 2021

parkyo commented Apr 14, 2021

parkyo commented Apr 19, 2021

phofl commented Apr 19, 2021

phofl commented Apr 23, 2021

parkyo commented Apr 23, 2021 • edited Loading

phofl commented May 1, 2021 • edited Loading

jbrockmendel commented May 1, 2021

phofl commented Apr 13, 2021 •

edited

Loading

Output of `pd.show_versions()`

parkyo commented Apr 23, 2021 •

edited

Loading

phofl commented May 1, 2021 •

edited

Loading