BUG: set_index with passing key of first level of MI produces invalid result #24683

jorisvandenbossche · 2019-01-09T12:27:40Z

I didn't find yet a small reproducible example, but with the actual (also small) data, I see the following problem:

In [47]: subjects_url = 'https://physionet.org/pn4/sleep-edfx/ST-subjects.xls'   
    ...: data = pd.read_excel(subjects_url, header=[0, 1])

In [48]: data.head()                                            
Out[48]: 
  Subject - age - sex           Placebo night            Temazepam night           
                   Nr Age M1/F2      night nr lights off        night nr lights off
0                   1  60     1             1   23:01:00               2   23:48:00
1                   2  35     2             2   23:27:00               1   00:00:00
2                   4  18     2             1   23:53:00               2   22:37:00
3                   5  32     2             2   23:23:00               1   23:34:00
4                   6  35     2             1   23:28:00               2   23:26:00

When doing a set_index with a key of the first level of the index (which I think is not supported), it actually gives a result, but an invalid one, which is illustrated by the repr that is erroring:

In [49]: res = data.set_index('Subject - age - sex')                       

In [50]: res                                         
Out[50]: ---------------------------------------------------------------------------
...
TypeError: unsupported format string passed to numpy.ndarray.__format__

The invalid part is that res.index seems to be an Int64Index, but is backed by a 2D array:

In [51]: res.index                                                     
Out[51]: 
Int64Index([ 1, 60,  1,  2, 35,  2,  4, 18,  2,  5, 32,  2,  6, 35,  2,  7, 51,
             2,  8, 66,  2,  9, 47,  1, 10, 20,  2, 11, 21,  2, 12, 21,  1, 13,
            22,  1, 14, 20,  1, 15, 66,  2, 16, 79,  2, 17, 48,  2, 18, 53,  2,
            19, 28,  2, 20, 24,  1, 21, 34,  2, 22, 56,  1, 24, 48,  2],
           dtype='int64', name='Subject - age - sex')

In [52]: res.index.values                                                  
Out[52]: 
array([[ 1, 60,  1],
       [ 2, 35,  2],
       [ 4, 18,  2],
       [ 5, 32,  2],
       [ 6, 35,  2],
       [ 7, 51,  2],
       [ 8, 66,  2],
       [ 9, 47,  1],
       [10, 20,  2],
       [11, 21,  2],
       [12, 21,  1],
       [13, 22,  1],
       [14, 20,  1],
       [15, 66,  2],
       [16, 79,  2],
       [17, 48,  2],
       [18, 53,  2],
       [19, 28,  2],
       [20, 24,  1],
       [21, 34,  2],
       [22, 56,  1],
       [24, 48,  2]])

Done with up to date master (0.24.dev)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-01-09T12:32:26Z

And here a smaller example (with floats, same problem):

In [60]: df = pd.DataFrame(np.random.randn(5, 4), columns=pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']]))                                                                                                     

In [61]: df                                                                                                                                                                                                         
Out[61]: 
          A                   B          
          a         b         a         b
0  2.240440  1.307720 -0.372765 -0.337577
1 -0.629807 -2.324882  0.563864  0.927735
2  0.130902 -1.504765  0.527028 -1.363527
3  0.091080  0.385927 -0.700174  0.197924
4 -0.810808 -0.334973 -3.077700 -0.739245

In [62]: res = df.set_index('A')                                                                                                                                                                                    

In [63]: res.index                                                                                                                                                                                                  
Out[63]: 
Float64Index([  2.240440060901442,   1.307720399690183, -0.6298065185919764,
              -2.3248818238121283,  0.1309021564092663,  -1.504764607116495,
              0.09107969093031175, 0.38592735287951835, -0.8108081356662055,
              -0.3349725297153279],
             dtype='float64', name='A')

In [64]: res.index.values                                                                                                                                                                                           
Out[64]: 
array([[ 2.24044006,  1.3077204 ],
       [-0.62980652, -2.32488182],
       [ 0.13090216, -1.50476461],
       [ 0.09107969,  0.38592735],
       [-0.81080814, -0.33497253]])

arw2019 · 2020-09-24T04:28:28Z

On 1.2 master the OP throws at set_index:

In [16]: df = pd.DataFrame(np.random.randn(5, 4), columns=pd.MultiIndex.from_product([['A', 'B'], ['a',
    ...:  'b']]))                                                                                      

In [17]: df                                                                                            
Out[17]: 
          A                   B          
          a         b         a         b
0  0.029458  0.639062 -0.405116  1.329762
1 -0.029833  0.670068  0.279081  0.259562
2 -0.003328 -0.585462  2.433622  1.408814
3 -0.620299 -0.255258  0.099439 -0.289729
4  0.691509 -0.801464  0.506687 -0.297512

In [18]: res = df.set_index('A')                                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-c4a76a0c5158> in <module>
----> 1 res = df.set_index('A')

/workspaces/pandas-arw2019/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   4635                 )
   4636 
-> 4637         index = ensure_index_from_sequences(arrays, names)
   4638 
   4639         if verify_integrity and not index.is_unique:

/workspaces/pandas-arw2019/pandas/core/indexes/base.py in ensure_index_from_sequences(sequences, names)
   5595         if names is not None:
   5596             names = names[0]
-> 5597         return Index(sequences[0], name=names)
   5598     else:
   5599         return MultiIndex.from_arrays(sequences, names=names)

/workspaces/pandas-arw2019/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    393                 return UInt64Index(data, copy=copy, dtype=dtype, name=name)
    394             elif is_float_dtype(data.dtype):
--> 395                 return Float64Index(data, copy=copy, dtype=dtype, name=name)
    396             elif issubclass(data.dtype.type, bool) or is_bool_dtype(data):
    397                 subarr = data.astype("object")

/workspaces/pandas-arw2019/pandas/core/indexes/numeric.py in __new__(cls, data, dtype, copy, name)
     70         if subarr.ndim > 1:
     71             # GH#13601, GH#20285, GH#27125
---> 72             raise ValueError("Index data must be 1-dimensional")
     73 
     74         subarr = np.asarray(subarr)

ValueError: Index data must be 1-dimensional

This seems like the right behavior? xref #25567 for the fix

Re: tests I think this is covered here:

pandas/pandas/tests/indexing/test_indexing.py

Lines 54 to 90 in f34a56b

    
               @pytest.mark.parametrize( 
        
                   "obj", 
        
                   [ 
        
                       lambda i: Series(np.arange(len(i)), index=i), 
        
                       lambda i: DataFrame(np.random.randn(len(i), len(i)), index=i, columns=i), 
        
                   ], 
        
                   ids=["Series", "DataFrame"], 
        
               ) 
        
               @pytest.mark.parametrize( 
        
                   "idxr, idxr_id", 
        
                   [ 
        
                       (lambda x: x, "getitem"), 
        
                       (lambda x: x.loc, "loc"), 
        
                       (lambda x: x.iloc, "iloc"), 
        
                   ], 
        
               ) 
        
               def test_getitem_ndarray_3d(self, index, obj, idxr, idxr_id): 
        
                   # GH 25567 
        
                   obj = obj(index) 
        
                   idxr = idxr(obj) 
        
                   nd3 = np.random.randint(5, size=(2, 2, 2)) 
        
                   msg = "|".join( 
        
                       [ 
        
                           r"Buffer has wrong number of dimensions \(expected 1, got 3\)", 
        
                           "Cannot index with multidimensional key", 
        
                           r"Wrong number of dimensions. values.ndim != ndim \[3 != 1\]", 
        
                           "Index data must be 1-dimensional", 
        
                           "positional indexers are out-of-bounds", 
        
                           "Indexing a MultiIndex with a multidimensional key is not implemented", 
        
                       ] 
        
                   ) 
        
                   potential_errors = (IndexError, ValueError, NotImplementedError) 
        
                   with pytest.raises(potential_errors, match=msg): 
        
                       with tm.assert_produces_warning(DeprecationWarning, check_stacklevel=False): 
        
                           idxr[nd3]

so potentially this issue can be closed.

jorisvandenbossche added Bug MultiIndex labels Jan 9, 2019

jorisvandenbossche added this to the Contributions Welcome milestone Jan 9, 2019

mroeschke closed this as completed Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: set_index with passing key of first level of MI produces invalid result #24683

BUG: set_index with passing key of first level of MI produces invalid result #24683

jorisvandenbossche commented Jan 9, 2019 •

edited

Loading

jorisvandenbossche commented Jan 9, 2019

arw2019 commented Sep 24, 2020

BUG: set_index with passing key of first level of MI produces invalid result #24683

BUG: set_index with passing key of first level of MI produces invalid result #24683

Comments

jorisvandenbossche commented Jan 9, 2019 • edited Loading

jorisvandenbossche commented Jan 9, 2019

arw2019 commented Sep 24, 2020

jorisvandenbossche commented Jan 9, 2019 •

edited

Loading