Skip to content

BUG: DafaFrame.insert doesn't raise an exception when inserting another DataFrame #42403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
YarShev opened this issue Jul 6, 2021 · 6 comments · Fixed by #42831
Closed
2 of 3 tasks

BUG: DafaFrame.insert doesn't raise an exception when inserting another DataFrame #42403

YarShev opened this issue Jul 6, 2021 · 6 comments · Fixed by #42831
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite
Milestone

Comments

@YarShev
Copy link
Contributor

YarShev commented Jul 6, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas
df = pandas.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df.insert(1, "newcol", df)
df
   col1 newcol  col2
0     1   col1     3
1     2   col2     4

Problem description

Dataframe should take int, Series, or array-like as a value and throw an exception in case DataFrame is passed.

Expected Output

Behavior in 1.2.5

import pandas
df = pandas.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df.insert(1, "newcol", df)
ValueError: Wrong number of items passed 2, placement implies 1

Output of pd.show_versions()

pandas : 1.3.0
numpy : 1.20.3
pytz : 2020.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2021.05.0
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : 0.15.0
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : 1.4.18
tables : 3.6.1
tabulate : None
xarray : 0.18.2
xlrd : 2.0.1
xlwt : None
numba : None

@YarShev YarShev added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2021
@rhshadrach
Copy link
Member

Is a DataFrame considered "array-like"? To me, for an array-like object obj, the values used would be equivalent to list(obj), which is what DataFrame.insert is doing here.

@YarShev
Copy link
Contributor Author

YarShev commented Jul 7, 2021

I am not sure on passing df as a value in insert that entails an insertion of columns list is clear enough from documentation (link). The behavior in 1.2.5 looks more appropriate, at least to me.

@dchigarev
Copy link

dchigarev commented Jul 7, 2021

Is a DataFrame considered "array-like"?

It is since #38459, this PR removed special casing for DataFrame in _sanitize_column (called by DataFrame.insert). Since this commit, DataFrame is handled as any other array-like object:

  1. First insert checks that the lengths of value and the source frame are equal len(value) == len(df)
  2. Then it converts value into a list value = list(value)
  3. Insert converted value without length checks

This behavior looks ok, however for values, which len(value) != len(list(value)) (DataFrame is one of these objects) we may end up in inconsistent frame:

Example
import pandas

df = pandas.DataFrame({"col1": [1, 2, 3]})

# This frame passes `len(df) == len(value)` check, however `list(value)` generates
# a list which length is not equal to df (`len(value) != len(list(value)) != len(df)`)
value = pandas.DataFrame({"col2": [4, 5, 6], "col3": [7, 8, 9]})

# `df` became an inconsistent frame
df.insert(loc=0, column="value", value=value)
print(df)
#   value  col1
# 0  col2     1
# 1  col3     2
# 2           3  <-- missing value at this row?

print(df.iloc[2]) # IndexError: index 2 is out of bounds for axis 1 with size 2
print(df.T) # ValueError: could not broadcast input array from shape (1,2) into shape (1,3)

I'm not sure whether it's insert problem (it doesn't double-check the length of the value) or in the meaning of DataFrame iterator len(df) != len(list(df)), probably both?

@rhshadrach
Copy link
Member

I'm not sure whether it's insert problem (it doesn't double-check the length of the value) or in the meaning of DataFrame iterator len(df) != len(list(df)), probably both?

Intuitively I think of len(df) as the number of rows rather than columns, but I see how this doesn't play well with the perspective that a DataFrame is a mapping of labels to Series. Interestingly, the docstring for DataFrame.__len__ is:

Returns length of info axis, but here we use the index.

where the info axis for DataFrame is "columns". Even if __len__ reflecting the number of columns is deemed "better", unless this poses serious issues, it seems to me changing would involve more trouble than it's worth.

@YarShev
Copy link
Contributor Author

YarShev commented Jul 29, 2021

Feel free to close the issue because I see expected behavior of the operation to me since 1.3.1.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2021

we should have an explicit test for this

@jreback jreback added Testing pandas testing functions or related to the test suite good first issue Indexing Related to indexing on series/frames, not to indexes themselves and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 3, 2021
@jreback jreback added this to the 1.4 milestone Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants