BUG: NaNs generated when mutating subset of column MultiIndex via loc #45751

mouna-apperson · 2022-02-01T06:57:10Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> x = pd.DataFrame({("a","x"): [1,2,3], ("b","x"): [4,5,6], ("b", "y"): [7,8,9]})
>>> x.loc[:, 'b'] /= 2
>>> x
   a   b    
   x   x   y
0  1 NaN NaN
1  2 NaN NaN
2  3 NaN NaN

Issue Description

It appears that loc does no work correctly with modification operators when I would guess that there is some "fast path" code on loc which assumes a Series output if a scalar is passed for the second argument. Likely the programmer didn't consider that it may actually match multiple columns.

Expected Behavior

I would expect the the above to work the same as the following:

>>> x = pd.DataFrame({("a","x"): [1,2,3], ("b","x"): [4,5,6], ("b", "y"): [7,8,9]})
>>> x.loc[:, ['b']] /= 2
>>> x
   a    b     
   x    x    y
0  1  2.0  3.5
1  2  2.5  4.0
2  3  3.0  4.5

Or, to work as the following works:

>>> x = pd.DataFrame({("a","x"): [1,2,3], ("b","x"): [4,5,6], ("b", "y"): [7,8,9]})
>>> x.loc[:, 'b'] = 2
>>> x
   a  b   
   x  x  y
0  1  2  2
1  2  2  2
2  3  2  2

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-94-generic
Version : #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0
numpy : 1.22.0
pytz : 2021.3
dateutil : 2.8.2
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.30.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

phofl · 2022-02-02T23:03:19Z

I think this is kind of expected, but certainly weird.

If you slice a MultiIndex only on one level, this level gets removed, hence your rhs has a regular index. We try aligning this with the original df, which fails, hence the nans.

Investigations welcome

mouna-apperson · 2022-02-02T23:36:27Z

@phofl Thanks for the reply. I'm still a bit confused why it works with the first example in the expected behavior (passing an array of only one element) and not in the case I marked a bug. I did my best to figure out if it was behaving as designed and I'm not trying to waste anyone's time. I'm sorry if I am.

Obviously there's some reason in the code, but it is odd that ['b'] on the rhs is able to align but 'b' is not. If I'm understanding you correctly, it seems you're saying that it generates a copy with \= and modifies it, but instead of preserving the indices it sliced, it creates a new index for the copy that it then later tried to align. I guess I'm not sure I understand why it wouldn't either:

Perform the operation in place; or,
Keep a reference to the original index. It seems wasteful (performance-wise) to create a new index for a modify-assign operation.

I haven't looked too deep into pandas code and work has me terribly backlogged at the moment, but if I can get caught up on work, maybe I'll look a little deeper into it. Thanks for your reply. I'm still learning pandas.

phofl · 2022-02-02T23:40:08Z

Inplace modification is not how pandas works. We apply a getitem onto df, divide it by 2 and set it back onto the original df via setitem.

the getitem reduces the MultiIndex to a regular index, because the level is dropped.

when you assign a scalar, no alignment is done. I think if you replace a with (a, slice(None)) this should work

sappersapper · 2022-05-25T09:03:23Z

It seems these ways can get the expected result:
x.loc[:, ['b']] /= 2
x.loc[:, ('b', slice(None))] /= 2
x['b'] /= 2

While x.loc[:, 'b'] /= 2 fails.

quite weird.

jbrockmendel · 2023-09-14T23:05:19Z

Another example from my notes:

import pandas as pd

mi = pd.MultiIndex.from_product([(0, 1), (2, 3)])
ser = pd.Series([True]*4, index=mi)

ser.loc[0,:] = ser.loc[0,:]

ATM this warns that it will raise in the future (PDEP6)

mouna-apperson added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 1, 2022

phofl added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 2, 2022

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 10, 2022

simonjayhawkins mentioned this issue Jul 20, 2022

BUG: multi-index df fillna with value not working after 1.4.3 #47649

Closed

3 tasks

xr-chen mentioned this issue Aug 3, 2022

BUG: Fix fillna on multi indexed Dataframe doesn't work #47774

Merged

5 tasks

mroeschke mentioned this issue Sep 1, 2022

REF: avoid FutureWarning about using deprecates loc.__setitem__ non-inplace usage #48254

Merged

5 tasks

mesvam mentioned this issue Nov 21, 2024

BUG: MultiIndex block assignment introduces NaNs in data #40186

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: NaNs generated when mutating subset of column MultiIndex via loc #45751

BUG: NaNs generated when mutating subset of column MultiIndex via loc #45751

mouna-apperson commented Feb 1, 2022 •

edited

Loading

INSTALLED VERSIONS

phofl commented Feb 2, 2022

mouna-apperson commented Feb 2, 2022

phofl commented Feb 2, 2022

sappersapper commented May 25, 2022 •

edited

Loading

jbrockmendel commented Sep 14, 2023

BUG: NaNs generated when mutating subset of column MultiIndex via loc #45751

BUG: NaNs generated when mutating subset of column MultiIndex via loc #45751

Comments

mouna-apperson commented Feb 1, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

phofl commented Feb 2, 2022

mouna-apperson commented Feb 2, 2022

phofl commented Feb 2, 2022

sappersapper commented May 25, 2022 • edited Loading

jbrockmendel commented Sep 14, 2023

mouna-apperson commented Feb 1, 2022 •

edited

Loading

sappersapper commented May 25, 2022 •

edited

Loading