Skip to content

reset_index() doesn't work with CategoricalIndex columns #19136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Stanpol opened this issue Jan 8, 2018 · 12 comments · Fixed by #45095
Closed

reset_index() doesn't work with CategoricalIndex columns #19136

Stanpol opened this issue Jan 8, 2018 · 12 comments · Fixed by #45095
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@Stanpol
Copy link

Stanpol commented Jan 8, 2018

Code Sample, a copy-pastable example if possible

df=pd.DataFrame({'Year': np.random.randint(2000,2017,10000), 'Month': np.random.randint(1,12,10000), 'Data': np.random.randint(0,100,10000)})
grouped=df.groupby(['Year','Month', pd.cut(df.Data, range(0,100,10))]).size().unstack()

# This doesn't work:
grouped.reset_index() #returns TypeError: unorderable types: int() < str()

# This works:
grouped.columns=grouped.columns.astype('str')
grouped.reset_index()

Problem description

reset_index() should work with dataframes that have any types of columns.

Expected Output

Two extra columns with multiindex content, in the example above - Year and Month.
Column's type changes to string? Or documentation should specify that reset_index() doesn't work with specific types of columns.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.2.0
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Or documentation should specify that reset_index() doesn't work with specific types of columns.

I think that's my preference, but I don't feel strongly about it. I'd just rather not lose all the categorical information.

We could give an example like

grouped.rename(columns=str).reset_index().head()

@TomAugspurger TomAugspurger added API Design Categorical Categorical Data Type Docs Dtype Conversions Unexpected or buggy dtype conversions labels Jan 8, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jan 8, 2018
@maykulkarni
Copy link

Hi, I'd like to take this up

@gfyoung
Copy link
Member

gfyoung commented Jan 8, 2018

@maykulkarni : Go for it! No need to ask for permission if no one has said they are working on it.

@Stanpol
Copy link
Author

Stanpol commented Jan 11, 2018

Actually, CategoricalIndex columns break other things as well, like assigning a new column.

df=pd.DataFrame({'Year': np.random.randint(2000,2017,10000), 'Month': np.random.randint(1,12,10000), 'Data': np.random.randint(0,100,10000)})
grouped=df.groupby(['Year','Month', pd.cut(df.Data, range(0,100,10))]).size().unstack()

grouped.assign(A=1)

# TypeError: cannot determine next label for type <class 'str'>

Maybe Error text should be more meaningful?

@Gijs-Koot
Copy link

Hi, I ran into a similar problem, but I get a different error message. This seems to happens with a MultiIndex of categoricals.

    test = pd.DataFrame({
        "a": pd.Series(["a_1", "a_2"], dtype = "category"),
        "b": pd.Series(["b_1", "b_2"], dtype = "category")
    })

    problem = test.groupby(["a", "b"]).size().unstack(1)
    problem

This looks good, but

    problem.reset_index()

thows the following error

    TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

@akielbowicz
Copy link
Contributor

@Gijs-Koot
I had the same problem, and found a work around after I read this post
Category Type Error

def reset_index(df):
  '''Returns DataFrame with index as columns'''
  index_df = df.index.to_frame(index=False)
  df = df.reset_index(drop=True)
  #  In merge is important the order in which you pass the dataframes
  # if the index contains a Categorical. 
  # pd.merge(df, index_df, left_index=True, right_index=True) does not work
  return pd.merge(index_df, df, left_index=True, right_index=True)

It works and keeps the categorical type:

problem
b b_1 b_2
a
a_1 1 NaN
a_2 NaN 1
solved = reset_index(problem)
solved
a b_1 b_2
0 a_1 1 NaN
1 a_2 NaN 1
solved.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
a 2 non-null category
b_1 1 non-null float64
b_2 1 non-null float64
dtypes: category(1), float64(2)
memory usage: 210.0 bytes

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 14, 2018

The same error occurs when using pivot(), so you don't need to use groupby():

In [1]: import pandas as pd
   ...: pd.__version__
   ...:
Out[1]: '0.23.4'

In [2]: df = pd.DataFrame({"C1" : ['A0', 'A0', 'A0', 'A1', 'A1'],
   ...:                    "C2" : ['C', 'F', 'O', 'C', 'F'],
   ...:                    "V" : [(i+1)/10.0 for i in range(5)]
   ...:                   })
   ...: df
   ...:
Out[2]:
   C1 C2    V
0  A0  C  0.1
1  A0  F  0.2
2  A0  O  0.3
3  A1  C  0.4
4  A1  F  0.5

In [3]: df.C1 = df.C1.astype('category')
   ...: df.C2 = df.C2.astype("category")
   ...: df = df.set_index(['C1'])
   ...: df
   ...:
Out[3]:
   C2    V
C1
A0  C  0.1
A0  F  0.2
A0  O  0.3
A1  C  0.4
A1  F  0.5

In [4]: piv = df.pivot(columns="C2", values="V")
   ...: piv
   ...:
Out[4]:
C2    C    F    O
C1
A0  0.1  0.2  0.3
A1  0.4  0.5  NaN

In [5]: piv.reset_index()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-76239aefc8a2> in <module>
----> 1 piv.reset_index()

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   4137                 # to ndarray and maybe infer different dtype
   4138                 level_values = _maybe_casted_values(lev, lab)
-> 4139                 new_obj.insert(0, name, level_values)
   4140
   4141         new_obj.index = new_index

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)
   3220         value = self._sanitize_column(column, value, broadcast=False)
   3221         self._data.insert(loc, column, value,
-> 3222                           allow_duplicates=allow_duplicates)
   3223
   3224     def assign(self, **kwargs):

C:\Anaconda3\lib\site-packages\pandas\core\internals.py in insert(self, loc, item, value, allow_duplicates)
   4342
   4343         # insert to the axis; this could possibly raise a TypeError
-> 4344         new_axis = self.items.insert(loc, item)
   4345
   4346         block = make_block(values=value, ndim=self.ndim,

C:\Anaconda3\lib\site-packages\pandas\core\indexes\category.py in insert(self, loc, item)
    765         code = self.categories.get_indexer([item])
    766         if (code == -1) and not (is_scalar(item) and isna(item)):
--> 767             raise TypeError("cannot insert an item into a CategoricalIndex "
    768                             "that is not already an existing category")
    769

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 14, 2018

@TomAugspurger I'm not sure this is a "good first issue" with "Effort: Low". I thought I would fix it by changing the error message, but in the various examples given above, the error messages are different, and the paths that raise the TypeError are different as well. So I fear there are a lot of potential places that we need to worry about.

Fundamentally, I have to wonder if we should allow a CategoricalIndex as the index for the columns. So maybe a better fix would be that we raise errors if a user tries to make the columns backed by a CategoricalIndex, and if we do it internally in pandas (such as due to a groupby() or pivot()), we silently convert the index for the columns to use string names and make it a regular Index.

This needs some discussion. I ran into this problem in some work that I'm doing, and now I have a workaround by doing rename(columns=str), but I don't think the solution for pandas is to document that, given the wide variety of error messages that can occur.

Your thoughts?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 14, 2018

So I did a little more investigation, and I think the fundamental problem is that unstack() and pivot() are creating CategoricalIndex for the columns.

Note that when the columns are backed by a DateTimeIndex and you create a new column that is a string (via assign() or reset_index() or df['A']=...), then pandas does convert the DateTimeIndex to an Index of string objects.

So I see four possible solutions:

  1. Have pivot() and unstack() change the CategoricalIndex to a regular Index .
  2. Do (1), but add keyword arguments to control the behavior.
  3. Modify various operations like reset_index() and assign() and df['A']=... that create new columns to convert the CategoricalIndex to Index silently, just like is done for DateTimeIndex
  4. Modify various operations like reset_index() and assign() and df['A']=... that create new columns to create a new category if the columns are backed by a CategoricalIndex

IMHO, (1) is easiest, and I think it makes the most sense. With respect to (3), it seems it would be a bit of work to figure out all the places that new columns are created.

@TomAugspurger Need your opinion on this.

@TomAugspurger
Copy link
Contributor

I think that's my preference, but I don't feel strongly about it. I'd just rather not lose all the categorical information.

I think my opinion has changed since writing this. reset_index should work basically like concat, and upcast to object-dtype if needed.


There are two issues in the original post. First, there's a bug in CategoricalIndex[Interval].__contains__: #23705

Second, there's the issue of what to do for reset_index when the cols are categorical. In that case, I think we should convert to object dtype (drop the categorical info).

Simple test case:

In [21]: pd.DataFrame(columns=pd.Categorical(['a', 'b'])).reset_index()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-5bd5a62d9f80> in <module>()
----> 1 pd.DataFrame(columns=pd.Categorical(['a', 'b'])).reset_index()

~/sandbox/pandas/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   4274                 # to ndarray and maybe infer different dtype
   4275                 level_values = _maybe_casted_values(lev, lab)
-> 4276                 new_obj.insert(0, name, level_values)
   4277
   4278         new_obj.index = new_index

~/sandbox/pandas/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   3343         value = self._sanitize_column(column, value, broadcast=False)
   3344         self._data.insert(loc, column, value,
-> 3345                           allow_duplicates=allow_duplicates)
   3346
   3347     def assign(self, **kwargs):

~/sandbox/pandas/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1164
   1165         # insert to the axis; this could possibly raise a TypeError
-> 1166         new_axis = self.items.insert(loc, item)
   1167
   1168         block = make_block(values=value, ndim=self.ndim,

~/sandbox/pandas/pandas/core/indexes/category.py in insert(self, loc, item)
    792         code = self.categories.get_indexer([item])
    793         if (code == -1) and not (is_scalar(item) and isna(item)):
--> 794             raise TypeError("cannot insert an item into a CategoricalIndex "
    795                             "that is not already an existing category")
    796

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

@Dr-Irv I don't think we should change the behavior of pivot or unstack. I'm hopeful that 3 won't be too difficult. I think that BlockManager.insert may need to become aware of how to cast between indexes with multiple types.

bryevdv pushed a commit to bokeh/bokeh that referenced this issue Nov 29, 2018
* Transform Pandas Columns CategoricalIndex in list

#8420
pandas-dev/pandas#19136

* add test conversion DF with CategoricalIndex column to CDS

* remove white spaces for code quality
@yujund
Copy link

yujund commented Mar 11, 2020

I tried to convert the categoricalindex to index and it worked for me.

df1.columns = df1.columns.astype(list)

@mroeschke
Copy link
Member

This looks fixed on master now. Could use a test

In [34]: df=pd.DataFrame({'Year': np.random.randint(2000,2017,10000), 'Month': np.random.randint(1,12,10000), 'Data
    ...: ': np.random.randint(0,100,10000)})
    ...: grouped=df.groupby(['Year','Month', pd.cut(df.Data, range(0,100,10))]).size().unstack()
    ...:
    ...: # This doesn't work:
    ...: grouped.reset_index()
Out[34]:
Data  Year  Month  (0, 10]  (10, 20]  (20, 30]  (30, 40]  (40, 50]  (50, 60]  (60, 70]  (70, 80]  (80, 90]
0     2000      1        9         2         4         7         9         5         6         6         7
1     2000      2        1         4         4         6         4         4         2         5         6
2     2000      3        7         6         9         3         8         6         4         4         6
3     2000      4        6         8         5         4         6         7         9         5         7
4     2000      5        3         4         6         3         7         7         4         6         5
..     ...    ...      ...       ...       ...       ...       ...       ...       ...       ...       ...
182   2016      7        6         5         4         6        11         6         7         7         7
183   2016      8        3         9         6        11         6         5         7         5         3
184   2016      9        5         3         8         8         6         2         2         4         3
185   2016     10        5         3         3         2         7         8         9         6         4
186   2016     11        3         5         7         4         8         6         7         2         4

[187 rows x 11 columns]

@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed API Design Categorical Categorical Data Type Docs Dtype Conversions Unexpected or buggy dtype conversions labels Jun 12, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.