Skip to content

Wrong error from "reset_index()" when columns are MultiIndex and index name is incomplete column name #16120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue Apr 24, 2017 · 13 comments · Fixed by #16126
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Apr 24, 2017

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame(index=range(2), columns=pd.MultiIndex.from_tuples([['A', 'a'], ['B', '']]))

In [3]: df.index.name = 'B'

In [4]: df.reset_index()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-6983677cc901> in <module>()
----> 1 df.reset_index()

/home/pietro/nobackup/repo/pandas/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   3056                     name = tuple(name_lst)
   3057             values = _maybe_casted_values(self.index)
-> 3058             new_obj.insert(0, name, values)
   3059 
   3060         new_obj.index = new_index

/home/pietro/nobackup/repo/pandas/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   2517         value = self._sanitize_column(column, value, broadcast=False)
   2518         self._data.insert(loc, column, value,
-> 2519                           allow_duplicates=allow_duplicates)
   2520 
   2521     def assign(self, **kwargs):

/home/pietro/nobackup/repo/pandas/pandas/core/internals.py in insert(self, loc, item, value, allow_duplicates)
   3808         if not allow_duplicates and item in self.items:
   3809             # Should this be a different kind of error??
-> 3810             raise ValueError('cannot insert %s, already exists' % item)
   3811 
   3812         if not isinstance(loc, int):

TypeError: not all arguments converted during string formatting

Problem description

Clearly % item should be replaced with % (item,).

Expected Output

cannot insert ('B', ''), already exists

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.7.0-1-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.utf8
LOCALE: it_IT.UTF-8

pandas: 0.20.0rc1+7.gf8b25c282
pytest: 3.0.6
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
xarray: 0.9.1
IPython: 5.1.0.dev
sphinx: 1.4.9
patsy: 0.3.0-dev
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz toobaz changed the title Wrong error from "reset_index()" when columns are MultiIndex and index name is already present Wrong error from "reset_index()" when columns are MultiIndex and index name is incomplete column name Apr 24, 2017
@jreback
Copy link
Contributor

jreback commented Apr 24, 2017

yep, must not be very well tested.

@jreback jreback added Bug Difficulty Novice Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Apr 24, 2017
@jreback jreback added this to the Next Major Release milestone Apr 24, 2017
@toobaz
Copy link
Member Author

toobaz commented Apr 24, 2017

By the way:

df = pd.DataFrame(index=range(2), columns=pd.MultiIndex.from_tuples([['A', 'a'], ['B', 'b']]))
df.index.name = ('C', 'c')
df.reset_index()

... also fails, but I guess it's not supported (otherwise, I will open a new issue).

@jreback
Copy link
Contributor

jreback commented Apr 24, 2017

that's not supported. I suppose it could be but is pretty odd thing to do.

@jorisvandenbossche
Copy link
Member

Would it be useful to provide a name arg to reset_index where you can pass a name for the to be created column(s) ? Using that argument, passing a tuple for a multi-indexed column name could maybe be supported? (and also more general useful to prevent having to do the not-chainable df.index.name = .. before doing a reset_index)

@jreback
Copy link
Contributor

jreback commented Apr 25, 2017

actually u can do this with col_level i think

http://pandas-docs.github.io/pandas-docs-travis/generated/pandas.DataFrame.reset_index.html?highlight=reset_index#pandas.DataFrame.reset_index

we should evaluate those args though

@toobaz
Copy link
Member Author

toobaz commented Apr 25, 2017

@jreback We could make col_level accept a list/tuple maybe... but currently, I'm afraid @jorisvandenbossche 's suggestion doesn't have an equivalent.

By the way, if we do want the name arg or something analogous, then I guess we would also want my above example to work. Currently, the following is broken:

df.set_index(df.columns[0]).reset_index()

if df is, for instance, pd.DataFrame(columns=pd.MultiIndex.from_product([['A'], ['a', 'b']])).

@jreback
Copy link
Contributor

jreback commented Apr 25, 2017

By the way, if we do want the name arg or something analogous, then I guess we would also want my above example to work. Currently, the following is broken:

certainly not. This is exactly according to specs. This only takes a scalar or a list, you passed a tuple, which is inherently ambiguous.

In [6]: df = pd.DataFrame(columns=pd.MultiIndex.from_product([['A'], ['a', 'b']]))

In [7]: df.set_index([df.columns[0]]).reset_index()
ValueError: setting an array element with a sequence

In [8]: df.columns[0]
Out[8]: ('A', 'a')

@jreback jreback modified the milestones: 0.20.0, Next Major Release Apr 25, 2017
@toobaz
Copy link
Member Author

toobaz commented Apr 25, 2017

certainly not. This is exactly according to specs. This only takes a scalar or a list, you passed a tuple, which is inherently ambiguous.

OK, I follow you, but not your example that follows: you pass a list, so according to your argument it should work, shouldn't it?

(in fact, while I agree that df.set_index(df.columns[0]) can be ambiguous - until the day we finally agree on the fact that tuples for pandas are just not generic sequences of things - it works fine, the error comes from the reset_index() that follows)

@jreback
Copy link
Contributor

jreback commented Apr 25, 2017

OK, I follow you, but not your example that follows: you pass a list, so according to your argument it should work, shouldn't it?

that was a typo. I guess in theory it should work, but what does setting a it mean to set a Multi-Index named column as a single column. Then you have names as tuples.

In [4]: df = pd.DataFrame([[1, 2]], columns=pd.MultiIndex.from_product([['A'], ['a', 'b']]))

In [5]: df
Out[5]: 
   A   
   a  b
0  1  2

In [6]: df[('A', 'a')]
Out[6]: 
0    1
Name: (A, a), dtype: int64

this gets really complicated really fast and I don't see utility here.

@toobaz
Copy link
Member Author

toobaz commented Apr 25, 2017

this gets really complicated really fast and I don't see utility here.

OK, indeed I'm not pushing too strong on this. But if we do agree on implementing @jorisvandenbossche 's suggestion that for instance

df.reset_index(name=('C', 'c'))

should interpret C and c as the two labels in the two levels for a same column, then it would be weird to not allow reset_index() to automatically do the same thing from an index named ('C', 'c'). Anyway, these are details, first it must be decided name= is worth implementing. Clearly, it should be ready to accept a single name (including a tuple, as above) or a list of names, if the index is a MultiIndex.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2017

I actually don't think adding name is wortwhile either.

@jorisvandenbossche
Copy link
Member

Well, without the discussion of whether a tuple should be interpreted in case of a MultiIndex or not (as I agree with Pietro that it would be logical that it would work the same with the existing index name), a possible name argument would have the following use:

Consider eg this dataframe:

df = pd.DataFrame(np.random.randn(3,2), index=pd.date_range("2012-01-01", periods=3), columns=['A', 'B'])

If you want to reset the index and convert it in a column with a certain name, you could instead of this

df.index.name = 'time'
df = df.reset_index()

do it in a slightly shorter way:

df = df.reset_index(name='time')

So it is mainly convenience, and the ability to do it in a chain

@jorisvandenbossche
Copy link
Member

So I argued already before about this :-) #6878
The main bottleneck was that there actually already is a name arg in Series.reset_index (not in DataFrame.reset_index) which already does something else (the name for the series values, not the series index)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants