-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: iloc fills multiple columns, if columns have duplicate names #12991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a tricky bug, though you are chained indexing here, so no guarantees. Instead do this.
|
That's what I did right now - thanks! |
A "more licit" example of the bug in action: In [4]: df1 = DataFrame([{'A': None, 'B': 1}, {'A': 2, 'B': 2}])
...: df2 = DataFrame([{'A': 3, 'B': 3}, {'A': 4, 'B': 4}])
...: df = concat([df1, df2], axis=1)
...: df.iloc[0, 0] = 15
...: df
...:
Out[4]:
A B A B
0 15.0 1 15.0 3
1 2.0 2 4.0 4 |
For reference: it is actually not the same bug, since it is fixed by #17163 , while the original example by the submitter is not. I still think, however, that there is some way to expose the (original) bug without chained assigning. |
Wouldn't it be better if pandas didn't allow us to have duplicate column names ? Having columns with same name would rather create confusion for us. |
Indeed, not having duplicate column names is a good idea. Enforcing this in pandas (and doing it consistently - that is, also on rows) would break a lot of code. |
But breaking it loudly and consistently is maybe better than breaking some of it silently. |
Indeed, the safest way to break buggy code loudly and consistently is, always, to just break all code. |
Not having duplicate column names would also help with sklearn, which is planing to introduce features operating on pandas and column names, but require column names to be unique to work. So having such limit in pandas dataframes would not make you be surprised when you pass it to sklearn and it complains. |
For sklearn devs, checking if an index is unique is as simple as checking For sklearn users, without context I don't know if what you describe is a legitimate need or just an implementation limit, but in any case, it will just have to be documented in sklearn and shouldn't affect the many non-sklearn users. In any case, I think this is vastly off topic here, as this is a very specific bug caused by a code path which should be rewritten anyway. |
Creating a DataFrame with two columns given duplicate names, changing data via indexer changes both columns:
Gives:
d e d
a 3 0 3
b 0 0 0
c 0 0 0
Instead, it should only edit first column:
d e d
a 3 0 0
b 0 0 0
c 0 0 0
INSTALLED VERSIONS
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-68-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
pandas: 0.17.1
nose: 1.3.6
pip: 1.5.4
setuptools: 3.3
Cython: None
numpy: 1.10.1
scipy: 0.13.3
statsmodels: None
IPython: 3.1.0
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.3
xlwt: 0.8.0
xlsxwriter: 0.7.7
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 1.0.4
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None
The text was updated successfully, but these errors were encountered: