Skip to content

BUG: set_index with passing key of first level of MI produces invalid result #24683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jan 9, 2019 · 2 comments

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 9, 2019

I didn't find yet a small reproducible example, but with the actual (also small) data, I see the following problem:

In [47]: subjects_url = 'https://physionet.org/pn4/sleep-edfx/ST-subjects.xls'   
    ...: data = pd.read_excel(subjects_url, header=[0, 1])

In [48]: data.head()                                            
Out[48]: 
  Subject - age - sex           Placebo night            Temazepam night           
                   Nr Age M1/F2      night nr lights off        night nr lights off
0                   1  60     1             1   23:01:00               2   23:48:00
1                   2  35     2             2   23:27:00               1   00:00:00
2                   4  18     2             1   23:53:00               2   22:37:00
3                   5  32     2             2   23:23:00               1   23:34:00
4                   6  35     2             1   23:28:00               2   23:26:00

When doing a set_index with a key of the first level of the index (which I think is not supported), it actually gives a result, but an invalid one, which is illustrated by the repr that is erroring:

In [49]: res = data.set_index('Subject - age - sex')                       

In [50]: res                                         
Out[50]: ---------------------------------------------------------------------------
...
TypeError: unsupported format string passed to numpy.ndarray.__format__

The invalid part is that res.index seems to be an Int64Index, but is backed by a 2D array:

In [51]: res.index                                                     
Out[51]: 
Int64Index([ 1, 60,  1,  2, 35,  2,  4, 18,  2,  5, 32,  2,  6, 35,  2,  7, 51,
             2,  8, 66,  2,  9, 47,  1, 10, 20,  2, 11, 21,  2, 12, 21,  1, 13,
            22,  1, 14, 20,  1, 15, 66,  2, 16, 79,  2, 17, 48,  2, 18, 53,  2,
            19, 28,  2, 20, 24,  1, 21, 34,  2, 22, 56,  1, 24, 48,  2],
           dtype='int64', name='Subject - age - sex')

In [52]: res.index.values                                                  
Out[52]: 
array([[ 1, 60,  1],
       [ 2, 35,  2],
       [ 4, 18,  2],
       [ 5, 32,  2],
       [ 6, 35,  2],
       [ 7, 51,  2],
       [ 8, 66,  2],
       [ 9, 47,  1],
       [10, 20,  2],
       [11, 21,  2],
       [12, 21,  1],
       [13, 22,  1],
       [14, 20,  1],
       [15, 66,  2],
       [16, 79,  2],
       [17, 48,  2],
       [18, 53,  2],
       [19, 28,  2],
       [20, 24,  1],
       [21, 34,  2],
       [22, 56,  1],
       [24, 48,  2]])

Done with up to date master (0.24.dev)

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jan 9, 2019
@jorisvandenbossche
Copy link
Member Author

And here a smaller example (with floats, same problem):

In [60]: df = pd.DataFrame(np.random.randn(5, 4), columns=pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']]))                                                                                                     

In [61]: df                                                                                                                                                                                                         
Out[61]: 
          A                   B          
          a         b         a         b
0  2.240440  1.307720 -0.372765 -0.337577
1 -0.629807 -2.324882  0.563864  0.927735
2  0.130902 -1.504765  0.527028 -1.363527
3  0.091080  0.385927 -0.700174  0.197924
4 -0.810808 -0.334973 -3.077700 -0.739245

In [62]: res = df.set_index('A')                                                                                                                                                                                    

In [63]: res.index                                                                                                                                                                                                  
Out[63]: 
Float64Index([  2.240440060901442,   1.307720399690183, -0.6298065185919764,
              -2.3248818238121283,  0.1309021564092663,  -1.504764607116495,
              0.09107969093031175, 0.38592735287951835, -0.8108081356662055,
              -0.3349725297153279],
             dtype='float64', name='A')

In [64]: res.index.values                                                                                                                                                                                           
Out[64]: 
array([[ 2.24044006,  1.3077204 ],
       [-0.62980652, -2.32488182],
       [ 0.13090216, -1.50476461],
       [ 0.09107969,  0.38592735],
       [-0.81080814, -0.33497253]])

@arw2019
Copy link
Member

arw2019 commented Sep 24, 2020

On 1.2 master the OP throws at set_index:

In [16]: df = pd.DataFrame(np.random.randn(5, 4), columns=pd.MultiIndex.from_product([['A', 'B'], ['a',
    ...:  'b']]))                                                                                      

In [17]: df                                                                                            
Out[17]: 
          A                   B          
          a         b         a         b
0  0.029458  0.639062 -0.405116  1.329762
1 -0.029833  0.670068  0.279081  0.259562
2 -0.003328 -0.585462  2.433622  1.408814
3 -0.620299 -0.255258  0.099439 -0.289729
4  0.691509 -0.801464  0.506687 -0.297512

In [18]: res = df.set_index('A')                                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-c4a76a0c5158> in <module>
----> 1 res = df.set_index('A')

/workspaces/pandas-arw2019/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   4635                 )
   4636 
-> 4637         index = ensure_index_from_sequences(arrays, names)
   4638 
   4639         if verify_integrity and not index.is_unique:

/workspaces/pandas-arw2019/pandas/core/indexes/base.py in ensure_index_from_sequences(sequences, names)
   5595         if names is not None:
   5596             names = names[0]
-> 5597         return Index(sequences[0], name=names)
   5598     else:
   5599         return MultiIndex.from_arrays(sequences, names=names)

/workspaces/pandas-arw2019/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    393                 return UInt64Index(data, copy=copy, dtype=dtype, name=name)
    394             elif is_float_dtype(data.dtype):
--> 395                 return Float64Index(data, copy=copy, dtype=dtype, name=name)
    396             elif issubclass(data.dtype.type, bool) or is_bool_dtype(data):
    397                 subarr = data.astype("object")

/workspaces/pandas-arw2019/pandas/core/indexes/numeric.py in __new__(cls, data, dtype, copy, name)
     70         if subarr.ndim > 1:
     71             # GH#13601, GH#20285, GH#27125
---> 72             raise ValueError("Index data must be 1-dimensional")
     73 
     74         subarr = np.asarray(subarr)

ValueError: Index data must be 1-dimensional

This seems like the right behavior? xref #25567 for the fix

Re: tests I think this is covered here:

@pytest.mark.parametrize(
"obj",
[
lambda i: Series(np.arange(len(i)), index=i),
lambda i: DataFrame(np.random.randn(len(i), len(i)), index=i, columns=i),
],
ids=["Series", "DataFrame"],
)
@pytest.mark.parametrize(
"idxr, idxr_id",
[
(lambda x: x, "getitem"),
(lambda x: x.loc, "loc"),
(lambda x: x.iloc, "iloc"),
],
)
def test_getitem_ndarray_3d(self, index, obj, idxr, idxr_id):
# GH 25567
obj = obj(index)
idxr = idxr(obj)
nd3 = np.random.randint(5, size=(2, 2, 2))
msg = "|".join(
[
r"Buffer has wrong number of dimensions \(expected 1, got 3\)",
"Cannot index with multidimensional key",
r"Wrong number of dimensions. values.ndim != ndim \[3 != 1\]",
"Index data must be 1-dimensional",
"positional indexers are out-of-bounds",
"Indexing a MultiIndex with a multidimensional key is not implemented",
]
)
potential_errors = (IndexError, ValueError, NotImplementedError)
with pytest.raises(potential_errors, match=msg):
with tm.assert_produces_warning(DeprecationWarning, check_stacklevel=False):
idxr[nd3]

so potentially this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants