Skip to content

BUG: reset_index of MultiIndex with CategoricalIndex levels with missing values fails #24206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Dec 10, 2018 · 3 comments · Fixed by #36876
Labels
Bug Categorical Categorical Data Type MultiIndex
Milestone

Comments

@jorisvandenbossche
Copy link
Member

MultiIndex with categorical levels without missing values, this works:

In [23]: idx = pd.MultiIndex([pd.CategoricalIndex(['A', 'B']), pd.CategoricalIndex(['a', 'b'])], [[0, 0, 1, 1], [0, 1, 0, 1]])


In [25]: df = pd.DataFrame({'col': range(len(idx))}, index=idx)

In [26]: df
Out[26]:
     col
A a    0
  b    1
B a    2
  b    3

In [28]: df.reset_index()
Out[28]:
  level_0 level_1  col
0       A       a    0
1       A       b    1
2       B       a    2
3       B       b    3

Now with a missing value (note the last -1 in the labels, that's the only difference):

In [29]: idx = pd.MultiIndex([pd.CategoricalIndex(['A', 'B']), pd.CategoricalIndex(['a', 'b'])], [[0, 0, 1, 1], [0, 1, 0, -1]])

In [30]: df = pd.DataFrame({'col': range(len(idx))}, index=idx)

In [31]: df
Out[31]:
       col
A a      0
  b      1
B a      2
  NaN    3

In [32]: df.reset_index()
/home/joris/miniconda3/lib/python3.5/site-packages/pandas/core/frame.py:4091: FutureWarning: Interpreting negative values in 'indexer' as missing values.
In the future, this will change to meaning positional indicies
from the right.

Use 'allow_fill=True' to retain the previous behavior and silence this
warning.

Use 'allow_fill=False' to accept the new behavior.
  values = values.take(labels)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda3/lib/python3.5/site-packages/pandas/core/dtypes/cast.py in maybe_upcast_putmask(result, mask, other)
    249         try:
--> 250             np.place(result, mask, other)
    251         except Exception:

~/miniconda3/lib/python3.5/site-packages/numpy/lib/function_base.py in place(arr, mask, vals)
   2371         raise TypeError("argument 1 must be numpy.ndarray, "
-> 2372                         "not {name}".format(name=type(arr).__name__))
   2373

TypeError: argument 1 must be numpy.ndarray, not Categorical

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-32-6983677cc901> in <module>()
----> 1 df.reset_index()

~/miniconda3/lib/python3.5/site-packages/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   4136                     name = tuple(name_lst)
   4137                 # to ndarray and maybe infer different dtype
-> 4138                 level_values = _maybe_casted_values(lev, lab)
   4139                 new_obj.insert(0, name, level_values)
   4140

~/miniconda3/lib/python3.5/site-packages/pandas/core/frame.py in _maybe_casted_values(index, labels)
   4092                     if mask.any():
   4093                         values, changed = maybe_upcast_putmask(
-> 4094                             values, mask, np.nan)
   4095             return values
   4096

~/miniconda3/lib/python3.5/site-packages/pandas/core/dtypes/cast.py in maybe_upcast_putmask(result, mask, other)
    250             np.place(result, mask, other)
    251         except Exception:
--> 252             return changeit()
    253
    254     return result, False

~/miniconda3/lib/python3.5/site-packages/pandas/core/dtypes/cast.py in changeit()
    222             # isn't compatible
    223             r, _ = maybe_upcast(result, fill_value=other, copy=True)
--> 224             np.place(r, mask, other)
    225
    226             return r, True

~/miniconda3/lib/python3.5/site-packages/numpy/lib/function_base.py in place(arr, mask, vals)
   2370     if not isinstance(arr, np.ndarray):
   2371         raise TypeError("argument 1 must be numpy.ndarray, "
-> 2372                         "not {name}".format(name=type(arr).__name__))
   2373
   2374     return _insert(arr, mask, vals)

TypeError: argument 1 must be numpy.ndarray, not Categorical

@jorisvandenbossche jorisvandenbossche added Bug Categorical Categorical Data Type labels Dec 10, 2018
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Dec 10, 2018
@TomAugspurger
Copy link
Contributor

Note that this is fixed for Datetime-like https://github.com/pandas-dev/pandas/pull/24024/files#diff-1e79abbbdd150d4771b91ea60a4e1cc7R4203

I'm not sure what a general solution would look like.

@joseortiz3
Copy link
Contributor

joseortiz3 commented Apr 16, 2019

I got around this issue by repeatedly calling reset_index(0), once for each level in the index. So df.reset_index(0).reset_index(0) accomplishes without error what df.reset_index() should.

def reset_multi_index_safe(df):
    """Pandas has a bug with resetting categorical multi-index if one
    of the index categories has a missing value. Issue #24206"""
    try:
        df = df.reset_index()
    except TypeError: # pandas bug
        while type(df.index) is not pd.RangeIndex:
            df = df.reset_index(0)
    return df

@batterseapower
Copy link
Contributor

With Pandas 0.25.3 or Pandas 1.0.0 this fails in a slightly different way, with "ValueError: the result input must be a ndarray". This traceback is from Pandas 1:

  File "C:\Users\mboling\AppData\Local\Continuum\anaconda3\envs\pandas1test\lib\site-packages\pandas\core\frame.py", line 4600, in reset_index
    level_values = _maybe_casted_values(lev, lab)
  File "C:\Users\mboling\AppData\Local\Continuum\anaconda3\envs\pandas1test\lib\site-packages\pandas\core\frame.py", line 4551, in _maybe_casted_values
    values, _ = maybe_upcast_putmask(values, mask, np.nan)
  File "C:\Users\mboling\AppData\Local\Continuum\anaconda3\envs\pandas1test\lib\site-packages\pandas\core\dtypes\cast.py", line 272, in maybe_upcast_putmask
    raise ValueError("The result input must be a ndarray.")
ValueError: The result input must be a ndarray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type MultiIndex
Projects
None yet
6 participants