Skip to content

Unexpected Behavior when Setting partial row in MultiIndex-columned-Dataframe with Series #15310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
joseortiz3 opened this issue Feb 5, 2017 · 5 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@joseortiz3
Copy link
Contributor

joseortiz3 commented Feb 5, 2017

Here I demonstrate that the expected behavior of creating a pandas Series object, and using it to safely set values in a Multi-Index-Columned Dataframe does not work with the most ideal syntax, and that currently the only way to do this is needlessly cumbersome. The motivation behind using a series to set values in the dataframe is that this method uses the 'key:value' structure of the Series to set values in the Dataframe, which allows one to ignore the order of the 'keys', or even if 'keys' fully spans the second level of the Dataframe's column's multiIndex.

Edit: In other words, I want to use a dictionary-like object (such as Series, or anything with a 'key:value' structure) to safely set some values in a row of df, without worrying about the order of the values or how many of them there are.

Code Sample

>>> import pandas as pd
>>> multi_index = pd.MultiIndex.from_product([['Bob','Jon'],['hours','sales']])
>>> df = pd.DataFrame(index = pd.Index([0,1,2],name='day'), columns = multi_index)
>>> # Our dataframe's index and column start out sorted, so don't worry about that.
>>> df
      Bob         Jon      
    hours sales hours sales
day                        
0     NaN   NaN   NaN   NaN
1     NaN   NaN   NaN   NaN
2     NaN   NaN   NaN   NaN
>>> # We make a row...
>>> row = pd.Series({'sales':635.54,'hours':8})
>>>  #...that we want to insert here.
>>> df.loc[0,('Bob',list(row.index))]
Bob  hours    NaN
     sales    NaN
Name: 0, dtype: object
>>> # We try inserting that row in the specified position...
>>> df.loc[0,('Bob',list(row.index))] = row
>>> #...But nothing happens.
>>> df
      Bob         Jon      
    hours sales hours sales
day                        
0     NaN   NaN   NaN   NaN
1     NaN   NaN   NaN   NaN
2     NaN   NaN   NaN   NaN
>>> #If we insert just the values of the row, it works.
>>> df.loc[0,('Bob',list(row.index))] = row.values
>>> df
      Bob           Jon      
    hours   sales hours sales
day                          
0       8  635.54   NaN   NaN
1     NaN     NaN   NaN   NaN
2     NaN     NaN   NaN   NaN
>>> # Why? We see that the index of the row in the dataframe...
>>> df.loc[0,('Bob',list(row.index))].index
MultiIndex(levels=[[u'Bob', u'Jon'], [u'hours', u'sales']],
           labels=[[0, 0], [0, 1]])
>>> #...and the index of the Series object do not match
>>> row.index.
Index([u'hours', u'sales'], dtype='object')
>>> #If we make them match...
>>> row.index = df.loc[0,('Bob',list(row.index))].index
>>> #...Now it works flawlessly (with a different syntax)
>>> df.loc[0,row.index] = row
>>> df
      Bob           Jon      
    hours   sales hours sales
day                          
0       8  635.54   NaN   NaN
1     NaN     NaN   NaN   NaN
2     NaN     NaN   NaN   NaN

Problem description

Setting a few values in a sub-row of a Multi-Index-Columned Dataframe with a Series object should work even if the indices of the Dataframe sub-row and the Series object do not match, when there is no ambiguous way the assignment could be interpreted, like in the example.

In general, the only sorts of objects that currently can be assigned with any method to a sub-row of a MultiIndexed Dataframe are objects which either have no indices (like tuples or arrays), or have MultiIndex indices.

Expected Output

>>> #We see that the index of the row in the dataframe...
>>> df.loc[0,('Bob',list(row.index))].index
MultiIndex(levels=[[u'Bob', u'Jon'], [u'hours', u'sales']],
           labels=[[0, 0], [0, 1]])
>>> #...and the index of the Series object do not match...
>>> row.index
Index([u'hours', u'sales'], dtype='object')
>>> # But pandas has enough information to execute the assignment
>>> df.loc[0,('Bob',list(row.index))] = row #Works
>>> Even more Ideally with the simpler syntax
>>> df.loc[0,('Bob',row.index)] = row

If there is another method that works better than this, please also tell me. But it seems to me this should just work.

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.7.0
Cython: 0.24
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.5
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Feb 5, 2017

well you can do this

In [8]: df.loc[0, 'Bob'] = [1, 2]

In [9]: df
Out[9]: 
      Bob         Jon      
    hours sales hours sales
day                        
0       1     2   NaN   NaN
1     NaN   NaN   NaN   NaN
2     NaN   NaN   NaN   NaN

In [11]: df.loc[0, [('Bob', 'hours'), ('Bob', 'sales')]] = [1, 2]

In [12]: df
Out[12]: 
      Bob         Jon      
    hours sales hours sales
day                        
0       1     2   NaN   NaN
1     NaN   NaN   NaN   NaN
2     NaN   NaN   NaN   NaN

Your question are confusing, but I think you want to do this, yes?

In [14]: df.loc[0, 'Bob'] = Series([1, 2], ['hours', 'sales'])

In [15]: df
Out[15]: 
      Bob         Jon      
    hours sales hours sales
day                        
0     NaN   NaN   NaN   NaN
1     NaN   NaN   NaN   NaN
2     NaN   NaN   NaN   NaN

@joseortiz3
Copy link
Contributor Author

joseortiz3 commented Feb 5, 2017

Yes, that's right. I want to use a dictionary-like object (such as Series, or anything with a 'key:value' structure) to safely set some values in a row of df, without worrying about the order of the values or how many of them there are.

Sorry it's so confusing.

I think the main issue, or a big clue, is that your series has to have a MultiIndex for it to work.

>>> good_series_1.index # Ready to set for 'Bob'
MultiIndex(levels=[[u'Bob', u'Jon'], [u'hours', u'sales']],
           labels=[[0, 0], [0, 1]])
>>> good_series_2.index
MultiIndex(levels=[[u'Bob'], [u'hours', u'sales']],
           labels=[[0, 0], [1, 0]])

Otherwise, any way of setting values will result in NaN. So the bug is perhaps the inability to match Index and MultiIndex. Right now, the best working syntax I found is

>>> s = pd.Series([101.34,2],index=pd.MultiIndex.from_product([['Bob'],['sales','hours']]))
>>> df[1] = s

But weirdly if the index is a DateTimeIndex and I want to insert it at the end, sometimes the above syntax destroys all the other data, and I have to make sure I do the super-redundant

df[last_time_in_index,'Bob'] = s #Even though s, above, already has "Bob"

The setting functionality of multiindex dataframes is clearly buggy.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2017

#10440

I recall an issue about this, but sometimes hard to find things. Ok In theory this can work, but the code is a bit messy ATM. You are welcome to dive in and have a look. IIRC this has been this way for a long time.

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 5, 2017
@jreback jreback added this to the 0.20.0 milestone Feb 5, 2017
@jorisvandenbossche
Copy link
Member

You also get this strange effect where assigning itself introduces NaNs:

In [39]: multi_index = pd.MultiIndex.from_product([['Bob','Jon'],['hours','sales']])
    ...: df = pd.DataFrame(0, index = pd.Index([0,1,2],name='day'), columns = multi_index)
    ...: 

In [40]: df
Out[40]: 
      Bob         Jon      
    hours sales hours sales
day                        
0       0     0     0     0
1       0     0     0     0
2       0     0     0     0

In [41]: df.loc[0, 'Bob']
Out[41]: 
hours    0
sales    0
Name: 0, dtype: int64

In [42]: df.loc[0, 'Bob'] = df.loc[0, 'Bob'] + 1  ## or even without the + 1

In [43]: df
Out[43]: 
      Bob         Jon      
    hours sales hours sales
day                        
0     NaN   NaN     0     0
1     0.0   0.0     0     0
2     0.0   0.0     0     0

But, I think the original example of @joseortiz3 should not work? (so not to be considered as a bug)
Because I think we should expect the index of the assigned value to match with the left-hand-sides index. In the example above, however, the index matches.

@joseortiz3
Copy link
Contributor Author

joseortiz3 commented Feb 5, 2017

@jorisvandenbossche It might not be considered a bug, but just to clarify I think it should work, because there is no ambiguous way that instruction could be interpreted.

Note that in your example, the assignment only works if the thing on the Right Hand Side has a MultiIndex (even if it doesn't match the index of thing on the left)

>>> df.loc[0,'Bob'] = df.loc[0,pd.IndexSlice['Bob',:]]+1
>>> df
    Bob         Jon      
  hours sales hours sales
0     1     1     0     0
1     0     0     0     0
2     0     0     0     0
>>> df.loc[0,pd.IndexSlice['Bob',:]].index     # RHS index
MultiIndex(levels=[[u'Bob', u'Jon'], [u'hours', u'sales']],
           labels=[[0, 0], [0, 1]])
>>> df.loc[0,'Bob'].index                       # LHS index
Index([u'hours', u'sales'], dtype='object')

for assigning to Dataframes with a MultiIndex axis
Working:

  • Index <- Multi Index
  • MultiIndex <- MultiIndex

Not Working:

  • Index <- Index
  • MultiIndex <- Index

@jreback jreback modified the milestones: 0.21.0, 0.20.0 Mar 23, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

5 participants