Skip to content

BUG: Inconsistency for iloc between np.array and pd.array #40933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
phofl opened this issue Apr 13, 2021 · 9 comments · Fixed by #41288
Closed
3 tasks done

BUG: Inconsistency for iloc between np.array and pd.array #40933

phofl opened this issue Apr 13, 2021 · 9 comments · Fixed by #41288
Assignees
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@phofl
Copy link
Member

phofl commented Apr 13, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

We got some inconsistencies with iloc between setting a numpy array and a pandas array

df = pd.DataFrame(data={
    'col1': [1, 2, 3, 4],
    'col2': [3, 4, 5, 6],
    'col3': [6, 7, 8, 9],
})
rhs = pd.array([1, 2, 3])
df.iloc[[1, 2, 3]] = rhs

Problem description

This returns

   col1  col2  col3
0     1     3     6
1     1     1     1
2     2     2     2
3     3     3     3

When switching the rhs to a numpy array, this returns

rhs = np.array([1, 2, 3])

   col1  col2  col3
0     1     3     6
1     1     2     3
2     1     2     3
3     1     2     3

Same happens when using [[1, 2, 3], :] as indexer

Expected Output

We should be consistent here, if we want to change iloc for the numpy case I think we need to deprecate?

Is this already known?
cc @jbrockmendel @jorisvandenbossche

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : c8e077f python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-48-generic Version : #54~20.04.1-Ubuntu SMP Sat Mar 20 13:40:25 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.0.dev0+1320.gc8e077fa87.dirty
numpy : 1.20.0
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.1
hypothesis : 6.0.2
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : 0.7.1
matplotlib : 3.3.3
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 1.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.0
sqlalchemy : 1.3.22
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0
None

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 13, 2021
@jbrockmendel
Copy link
Member

and a pandas array

just to be clear, rhs = pd.array([1, 2, 3]) gives an IntegerArray, not a PandasArray, right?

We should be consistent here, if we want to change iloc for the numpy case I think we need to deprecate?

I think the behavior with ndarray is "more correct" here. Do you disagree?

@phofl
Copy link
Member Author

phofl commented Apr 14, 2021

just to be clear, rhs = pd.array([1, 2, 3]) gives an IntegerArray, not a PandasArray, right?

Sorry, I meant the pandas array constructor. Yes this returns an IntegerArray.

I think the behavior with ndarray is "more correct" here. Do you disagree?

Yeah I think so too

@parkyo
Copy link

parkyo commented Apr 14, 2021

take

parkyo pushed a commit to parkyo/pandas that referenced this issue Apr 19, 2021
@parkyo
Copy link

parkyo commented Apr 19, 2021

So it looks like there is no issue. The reason why you got different outputs was because you modified df directly once with pandas array and then applied iloc with a numpy array to the modified dataframe. Try it by declaring two dataframes for numpy array and pandas array like below.
Screen Shot 2021-04-18 at 9 22 16 PM

@phofl
Copy link
Member Author

phofl commented Apr 19, 2021

Your test is wrong.
Adding copy to the DataFrame assignment raises as expceted:

df = DataFrame(
    data={
        "col1": [1, 2, 3, 4],
        "col2": [3, 4, 5, 6],
        "col3": [6, 7, 8, 9],
    }
)
df_np = df.copy()
df_pd = df.copy()
np_arr = np.array([1, 2, 3])
pd_arr = pd.array([1, 2, 3])
df_np.iloc[[1, 2, 3]] = np_arr
df_pd.iloc[[1, 2, 3]] = pd_arr
tm.assert_frame_equal(df_np, df_pd)

@phofl
Copy link
Member Author

phofl commented Apr 23, 2021

@parkyo are you still working on this?

@parkyo
Copy link

parkyo commented Apr 23, 2021

@phofl Yes! I just made the pull request
#41028

@phofl
Copy link
Member Author

phofl commented May 1, 2021

@jbrockmendel got a follow up question here:

In 2d08672 you've added the testcase test_iloc_setitem_ea_inplace which basically comes down to:

df = DataFrame([1, 2, 3, 4])
rhs = pd.array([3, 4])
df.iloc[:2] = rhs

Which returns

   0
0  3
1  4
2  3
3  4

This contradicts the non-ea case, which returns:

rhs = np.array([1, 2, 3, 4])
obj = DataFrame([1, 2, 3, 4])
obj.iloc[:2] = rhs

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2021.1/scratches/scratch_4.py", line 553, in <module>
    obj.iloc[:2] = arr
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 717, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1713, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1949, in _setitem_single_block
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 369, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 338, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 941, in setitem
    check_setitem_lengths(indexer, value, values)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexers.py", line 184, in check_setitem_lengths
    raise ValueError(
ValueError: cannot set using a slice indexer with a different length than the value

This tries to set row-wise.
If we go with the non-ea case we have to turn that back and change the test. Would you agree?

@jbrockmendel
Copy link
Member

Yah, I think it would make more sense for the obj.iloc[:2] = rhs to be obj.iloc[:2, 0] = rhs. Would need to be sure that still goes through the affected code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action
Projects
None yet
4 participants