Skip to content

BUG: iloc can create columns #6766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bergtholdt opened this issue Apr 2, 2014 · 13 comments · Fixed by #7006
Closed

BUG: iloc can create columns #6766

bergtholdt opened this issue Apr 2, 2014 · 13 comments · Fixed by #7006
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@bergtholdt
Copy link

After a concat of two DataFrames with the same columns. I want to consolidate some data and remove NaNs in some columns by values in other columns. I ended up with a DataFrame that magically had additional columns.

This is the minimum example that I can give to reproduce the faulty behaviour using current master (70de129):

df1 = pd.DataFrame([{'A':None, 'B':1},{'A':2, 'B':2}])
df2 = pd.DataFrame([{'A':3, 'B':3},{'A':4, 'B':4}])
df = pd.concat([df1, df2], axis=1)
>>> df1
    A  B
0 NaN  1
1   2  2

[2 rows x 2 columns]

>>> df2
   A  B
0  3  3
1  4  4

[2 rows x 2 columns]

>>> df
    A  B  A  B
0 NaN  1  3  3
1   2  2  4  4

[2 rows x 4 columns]

Now replacing NaNs in the 0 column with (corresponding) values in the 2 column ('A'), I expected to simply write a 3 into NaN (which it did), but it actually added a column '0' at the end of the DataFrame even though iloc is not supposed to enlarge the dataset. Clearly a bug.

inds = np.isnan(df.iloc[:, 0])
df.iloc[:, 0][inds] = df.iloc[:, 2][inds]
>>> df
   A  B  A  B  0
0  3  1  3  3  3
1  2  2  4  4  2

[2 rows x 5 columns]
@jreback
Copy link
Contributor

jreback commented Apr 2, 2014

its a bug, but not for the reason you suggest.

doing ANYTHING like

df.iloc[:,0][inds] IS a chained assignment and should ALWAYS be avoided, so I wouldn't expect this to work in any event.

see here: http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-view-versus-copy

Further using duplicate columns is very tricky and should generally be avoided.

This is a bug because this should work:

In [39]: mask = inds[inds].index

In [40]: df.iloc[mask,0] = df.iloc[mask,2]
AssertionError: Cannot create BlockManager._ref_locs because block [FloatBlock: [A], 1 x 2, dtype: float64] with duplicate items [Index([u'A', u'A'], dtype='object')] does not have _ref_locs set

@jreback jreback added this to the 0.14.0 milestone Apr 2, 2014
@bergtholdt
Copy link
Author

Hi,

thanks for the quick reply. I actually tested your suggested solution first since it would be the intuitive way to do it. Actually I first tried

df.iloc[inds, 0] = ...

which raised a NotImplementedError:

NotImplementedError: iLocation based boolean indexing on an integer type is not available

Then this similar to yours

indexes = inds.nonzero()[0]
df.iloc[indexes, 0] = ...

and got the same error as you did (note also with scalars on the right hand side).

With trail and error I got the version at the top running in an older version of pandas, but current master then started creating these extra columns (though it also wrote the values at the proper location).
Also note that I run with

pd.set_option('mode.chained_assignment', 'raise')

And did not get an error for the version at the top.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2014

we have to 'guess' if something is chained as python syntax does not allow it to be detected. so its not an error that its not raised (SettingWIthCopy), but just hard to figure out.

iloc specific does NOT take a boolean indexer, but only an integer one. (on purpose).

ix is the soln here, but breaks for the same reason

df.ix[inds,2] = df.ix[inds,0] should work as well.

NEVER do chained assignment it is just not a good idea (if this is had been a single dtype it WOULD have worked), in a multi-dtype case it will also SOMETIMES work.

@bergtholdt
Copy link
Author

Just another comment that might be related. After creating the DataFrame with the multiple column here at the top, I also get a ValueError when doing a simple indexing like:

>>> df.iloc[0,0]
>>> df.iloc[0,:]

Both raise

ValueError: Wrong number of items passed 8, index implies 4
in C:\x64\Python27\lib\site-packages\pandas-0.13.1_550_g9039338-py2.7-win-amd64.egg\pandas\core\internals.pyc:64

Whereas this works

>>> df.iloc[:,0]

@jreback
Copy link
Contributor

jreback commented Apr 4, 2014

ok...these getitem issues with iloc (namely), df.iloc[0,0] and df.iloc[0,:] when the frame is created via a concat is fixed

the setting is a bit more complicated

@immerrr
Copy link
Contributor

immerrr commented Apr 15, 2014

iloc specific does NOT take a boolean indexer, but only an integer one. (on purpose).

Weird, I've always thought of iloc as "numpy-like" rather than "strictly-integer" indexer and I'd expect it work like np.ndarray get-/setitem methods. Performance- or implementation-complexity-wise, is there a reason to force users to route boolean indexers via loc?

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

the reason this was deliberately not done was because a boolean indexer normally requires alignment
which is a label based operation

alignment is not really possible in a logical sense

for example say you want to align a timeseries index vs an integer index

doesn't make sense

@immerrr
Copy link
Contributor

immerrr commented Apr 15, 2014

a boolean indexer normally requires alignment which is a label based operation

That is if the indexer is Series, what if it is an ndarray?

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

that should work but iirc that was taken out to make it less confusing

@hayd do u remember this?

@immerrr you might look back at the original iloc issue
it was in 0.11 (and was pretty long)

@immerrr
Copy link
Contributor

immerrr commented Apr 15, 2014

I've checked on current master: iloc[ndarray] works, loc[Series] works, so does loc[ndarray], only iloc[Series] doesn't, that indeed makes sense. And speaking of the issue at hand, this worked for me:

In [21]: inds = np.isnan(df.iloc[:, 0])

In [22]: inds
Out[22]: 
0     True
1    False
Name: a, dtype: bool

In [23]: inds.values
Out[23]: array([ True, False], dtype=bool)

In [24]: df.iloc[inds.values, 0]
Out[24]: 
0   NaN
Name: a, dtype: float64

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

#6799 fixed the last

Setitem in a duplicate frame with iloc is still not working

@hayd
Copy link
Contributor

hayd commented Apr 15, 2014

IIRC I was pro iloc working with masks, I think if the dtype is bool this is not ambiguous (currently this is the only reason I use ix!). I'm not sure I understand the argument re-alignment.

@jreback
Copy link
Contributor

jreback commented Apr 30, 2014

@immerrr your refactor seemed to have fixed this, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants