Management of null values as index names #18988

toobaz · 2017-12-29T15:02:02Z

Code Sample, a copy-pastable example if possible

In [2]: mi = pd.MultiIndex.from_product([[0, 1]]*2, names=[np.nan, float('nan')])

In [3]: mi.get_level_values(float('nan'))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-8a7d24d1cd67> in <module>()
----> 1 mi.get_level_values(float('nan'))

/home/nobackup/repo/pandas/pandas/core/indexes/multi.pyc in get_level_values(self, level)
    976         Index(['d', 'e', 'f'], dtype='object', name='level_2')
    977         """
--> 978         level = self._get_level_number(level)
    979         values = self._get_level_values(level)
    980         return values

/home/nobackup/repo/pandas/pandas/core/indexes/multi.pyc in _get_level_number(self, level)
    674         except ValueError:
    675             if not isinstance(level, int):
--> 676                 raise KeyError('Level %s not found' % str(level))
    677             elif level < 0:
    678                 level += self.nlevels

KeyError: 'Level nan not found'

In [4]: mi.get_level_values(np.nan)
Out[4]: Int64Index([0, 0, 1, 1], dtype='int64', name=nan)

Problem description

Strictly speaking, the above is not a bug in pandas, since float('nan') != float('nan') (but np.nan is np.nan): however, it is hardly useful/expected. None is also a bad label to use in reference to a level, since it is not necessarily unique (see this comment ).

So my suggestion is:

allow users to set duplicate names when such names are in {None, np.nan, pd.NaT} (Problems when accessing MultiIndex level with duplicated level names #18872 only allows None)
immediately raise if MultiIndex.get_level_values(.) (or analogous) is called with None, or np.nan, or pd.NaT.

Expected Output

ValueError from both In [3]: and In [4]:

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: 10edfd0
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: None.None

pandas: 0.23.0.dev0+5.g10edfd06c
pytest: 3.2.1
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.9
patsy: 0.4.1+dev
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 0.7.5
xlsxwriter: 0.9.6
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-29T15:06:10Z

I think we only allow actual None as a level-name (IOW its not set), so if you find a isna name that is not None I would raise on construction. I know that we could simply convert an isna name to None but better to be explicit I think.

should also disallow a isna value when accessing a required level name.

jreback · 2017-12-29T15:07:24Z

We could/should also do this for Index/Series names as well, though it may break things.

toobaz · 2017-12-29T15:15:56Z

I think we only allow actual None as a level-name (IOW its not set), so if you find a isna name that is not None I would raise on construction. I know that we could simply convert an isna name to None but better to be explicit I think.

I would agree if I wasn't scared that some (weird) data becomes unloadable (just a generic fear, I don't have any example in mind).

should also disallow a isna value when accessing a required level name.

Not following you here: you mean e.g. as argument to get_level_values()? But once they are forbidden as level names, then this will just reliably raise a KeyError which I think is fine.

toobaz · 2017-12-29T15:18:22Z

We could/should also do this for Index/Series names as well, though it may break things.

On this I'm skeptical, because if we have a DataFrame with np.nan in its columns, accessing the relative column results in a Series named np.nan... and I don't think we want to break this.

jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex labels Dec 29, 2017

jreback added this to the Next Major Release milestone Dec 29, 2017

jreback added the Difficulty Intermediate label Dec 29, 2017

MTKnife mentioned this issue Mar 6, 2018

Inconsistent and Hard-to-reproduce Issues with Null Index Values #20018

Closed

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

mroeschke added the Enhancement label Jun 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Management of null values as index names #18988

Management of null values as index names #18988

toobaz commented Dec 29, 2017

INSTALLED VERSIONS

jreback commented Dec 29, 2017

jreback commented Dec 29, 2017

toobaz commented Dec 29, 2017

toobaz commented Dec 29, 2017

Management of null values as index names #18988

Management of null values as index names #18988

Comments

toobaz commented Dec 29, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Dec 29, 2017

jreback commented Dec 29, 2017

toobaz commented Dec 29, 2017

toobaz commented Dec 29, 2017

Output of `pd.show_versions()`