Skip to content

Confusing behavior with (multi-)assignment and _LocIndexer/_IXIndexer #12947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DavidEscott opened this issue Apr 21, 2016 · 4 comments
Closed
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Milestone

Comments

@DavidEscott
Copy link

DavidEscott commented Apr 21, 2016

df = pandas.DataFrame([[1,2,3,4,5]], columns=["A", "B", "C", "D", "E"])
# Suppose you want to set df.B to df.C when df.A ==1
# then the following both work:
df.loc[df.A== 1, "B"] = df.loc[df.A == 1, "C"]
df.ix[df.A == 1, "B"] = df.ix[df.A == 1, "C"]
# you can even mix and match them with ix on one side and loc on the other

# but maybe you have two or more columns you want to set... Its natural to think that:
df.loc[df.A== 1, ["B", "C"]] = df.loc[df.A == 1, ["D", "E"]]
df.ix[df.A == 1, ["B", "C"]] = df.ix[df.A == 1, ["D", "E"]]
# but they actually just NaN out df.B and df.C (it isn't an issue of a silent copy losing updates)
# in fact the application of NaN even happens if you have singletons
df.loc[df.A== 1, ["D"]] = df.loc[df.A == 1, ["E"]]

# presumably because
type(df.ix[df.A == 1, "B"])
# is pandas.core.series.Series but 
type(df.ix[df.A == 1, ["B"]])
# is pandas.core.frame.DataFrame
# but when printed they look really similar... 
#0    3
# Name: B, dtype: int64
# versus
#    B
#0  3
# so it is easy to get confused

# If this can't be made to work in the natural fashion it would be a lot nicer if it could just throw an error
# the same way the following does:
df.ix[df.A == 1, ["B", "C"]] = df[df.A == 1]["D", "E"] 

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.29.2.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.4
Cython: None
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.3.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: None
boto: None

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

you are missing the point here, when you use multiple columns, pandas will align for you. so you need to give it a raw array/list if you are doing this.

In [29]: df.loc[df.A== 1, ["B", "C"]] = df.loc[df.A == 1, ["D", "E"]].values

In [30]: df
Out[30]: 
   A  B  C  D  E
0  1  4  5  4  5

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

I suppose you could do a warning section in the docs. interested in that?

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Docs Difficulty Novice labels Apr 21, 2016
@jreback jreback added this to the 0.18.2 milestone Apr 21, 2016
@DavidEscott
Copy link
Author

DavidEscott commented Apr 21, 2016

I don't follow at all. Here is a little more strangeness:

In [42]: df = pandas.DataFrame([[1,2,3,4,5]], columns=["A", "B", "C", "D", "E"])

In [43]: type(df[["B","C"]])
Out[43]: pandas.core.frame.DataFrame

In [44]: type(df.loc[df.A==1, ["B","C"]])
Out[44]: pandas.core.frame.DataFrame

In [45]: df[["B", "C"]] = df[["D", "E"]]

In [46]: df
Out[46]:
   A  B  C  D  E
0  1  4  5  4  5

So I can assign a DataFrame to another DataFrame (of compatible dimension just fine)
UNLESS one is a .loc or .ix of the other (and then stuff gets nulled out).

I don't understand the NaNs at all. LHS=RHS shouldn't result in LHS being None when RHS is not None. That doesn't sound like correct behavior at all.

Another weird thing that happens:

In [93]: df = pandas.DataFrame([[1,2,3,4,5]], columns=["A", "B", "C", "D", "E"])

In [94]: df2 = df.loc[:,["B","C"]]

In [95]: df3 = df.loc[:,["D","E"]]

In [96]: df2.loc[:,:] is df2
Out[96]: True

In [97]: df2.loc[:,:] = df3

In [98]: df2
Out[98]:
    B   C
0 NaN NaN

In [99]: df
Out[99]:
   A  B  C  D  E
0  1  2  3  4  5

but since df2.loc[:,:] is df2 this should be equivalent to: df.loc[:,["B","C"]] = df3 which of course we have seen is not the case.

Therefore with Pandas X.foo().bar() is not the same thing as _ = X.foo(); _.bar(). That is something I find super scary.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

you are doing 2 different things, in [45] you are saying take these columns and assign to these, this ignores alignment because its a column asssignment.

while above in my [29] you are assigning part of a frame, this is a conceptual difference and as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants