Skip to content

BUG: iloc fills multiple columns, if columns have duplicate names #12991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
henhuy opened this issue Apr 26, 2016 · 10 comments
Open

BUG: iloc fills multiple columns, if columns have duplicate names #12991

henhuy opened this issue Apr 26, 2016 · 10 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@henhuy
Copy link

henhuy commented Apr 26, 2016

Creating a DataFrame with two columns given duplicate names, changing data via indexer changes both columns:

a = pd.DataFrame(index=['a', 'b', 'c'], columns=['d', 'e', 'd']).fillna(0)
a.iloc[:, 0]['a'] = 3

Gives:
d e d
a 3 0 3
b 0 0 0
c 0 0 0

Instead, it should only edit first column:
d e d
a 3 0 0
b 0 0 0
c 0 0 0

INSTALLED VERSIONS

commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-68-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8

pandas: 0.17.1
nose: 1.3.6
pip: 1.5.4
setuptools: 3.3
Cython: None
numpy: 1.10.1
scipy: 0.13.3
statsmodels: None
IPython: 3.1.0
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.3
xlwt: 0.8.0
xlsxwriter: 0.7.7
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 1.0.4
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

This is a tricky bug, though you are chained indexing here, so no guarantees.

Instead do this.

In [7]: a.iloc[a.index.get_loc('a'), 0] = 3

In [8]: a
Out[8]: 
   d  e  d
a  3  0  0
b  0  0  0
c  0  0  0

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Advanced labels Apr 26, 2016
@jreback jreback added this to the Next Major Release milestone Apr 26, 2016
@henhuy
Copy link
Author

henhuy commented Apr 26, 2016

That's what I did right now - thanks!

@toobaz
Copy link
Member

toobaz commented Mar 14, 2017

A "more licit" example of the bug in action:

In [4]: df1 = DataFrame([{'A': None, 'B': 1}, {'A': 2, 'B': 2}])
   ...: df2 = DataFrame([{'A': 3, 'B': 3}, {'A': 4, 'B': 4}])
   ...: df = concat([df1, df2], axis=1)
   ...: df.iloc[0, 0] = 15
   ...: df
   ...: 
Out[4]: 
      A  B     A  B
0  15.0  1  15.0  3
1   2.0  2   4.0  4

@toobaz
Copy link
Member

toobaz commented Aug 9, 2017

A "more licit" example of the bug in action:

For reference: it is actually not the same bug, since it is fixed by #17163 , while the original example by the submitter is not.

I still think, however, that there is some way to expose the (original) bug without chained assigning.

@MohakGangwani
Copy link

Wouldn't it be better if pandas didn't allow us to have duplicate column names ? Having columns with same name would rather create confusion for us.

@toobaz
Copy link
Member

toobaz commented May 31, 2020

Having columns with same name would rather create confusion for us.

Indeed, not having duplicate column names is a good idea.

Enforcing this in pandas (and doing it consistently - that is, also on rows) would break a lot of code.

@mitar
Copy link
Contributor

mitar commented May 31, 2020

would break a lot of code.

But breaking it loudly and consistently is maybe better than breaking some of it silently.

@toobaz
Copy link
Member

toobaz commented May 31, 2020

But breaking it loudly and consistently is maybe better than breaking some of it silently.

Indeed, the safest way to break buggy code loudly and consistently is, always, to just break all code.

@mitar
Copy link
Contributor

mitar commented May 31, 2020

Not having duplicate column names would also help with sklearn, which is planing to introduce features operating on pandas and column names, but require column names to be unique to work. So having such limit in pandas dataframes would not make you be surprised when you pass it to sklearn and it complains.

@toobaz
Copy link
Member

toobaz commented May 31, 2020

sklearn, which is planing to introduce features operating on pandas and column names, but require column names to be unique to work

For sklearn devs, checking if an index is unique is as simple as checking .is_unique.

For sklearn users, without context I don't know if what you describe is a legitimate need or just an implementation limit, but in any case, it will just have to be documented in sklearn and shouldn't affect the many non-sklearn users.

In any case, I think this is vastly off topic here, as this is a very specific bug caused by a code path which should be rewritten anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants