Skip to content

Management of null values as index names #18988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
toobaz opened this issue Dec 29, 2017 · 4 comments
Open

Management of null values as index names #18988

toobaz opened this issue Dec 29, 2017 · 4 comments
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas MultiIndex

Comments

@toobaz
Copy link
Member

toobaz commented Dec 29, 2017

Code Sample, a copy-pastable example if possible

In [2]: mi = pd.MultiIndex.from_product([[0, 1]]*2, names=[np.nan, float('nan')])

In [3]: mi.get_level_values(float('nan'))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-8a7d24d1cd67> in <module>()
----> 1 mi.get_level_values(float('nan'))

/home/nobackup/repo/pandas/pandas/core/indexes/multi.pyc in get_level_values(self, level)
    976         Index(['d', 'e', 'f'], dtype='object', name='level_2')
    977         """
--> 978         level = self._get_level_number(level)
    979         values = self._get_level_values(level)
    980         return values

/home/nobackup/repo/pandas/pandas/core/indexes/multi.pyc in _get_level_number(self, level)
    674         except ValueError:
    675             if not isinstance(level, int):
--> 676                 raise KeyError('Level %s not found' % str(level))
    677             elif level < 0:
    678                 level += self.nlevels

KeyError: 'Level nan not found'

In [4]: mi.get_level_values(np.nan)
Out[4]: Int64Index([0, 0, 1, 1], dtype='int64', name=nan)

Problem description

Strictly speaking, the above is not a bug in pandas, since float('nan') != float('nan') (but np.nan is np.nan): however, it is hardly useful/expected. None is also a bad label to use in reference to a level, since it is not necessarily unique (see this comment ).

So my suggestion is:

Expected Output

ValueError from both In [3]: and In [4]:

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 10edfd0
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: None.None

pandas: 0.23.0.dev0+5.g10edfd06c
pytest: 3.2.1
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.9
patsy: 0.4.1+dev
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 0.7.5
xlsxwriter: 0.9.6
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

I think we only allow actual None as a level-name (IOW its not set), so if you find a isna name that is not None I would raise on construction. I know that we could simply convert an isna name to None but better to be explicit I think.

should also disallow a isna value when accessing a required level name.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex labels Dec 29, 2017
@jreback jreback added this to the Next Major Release milestone Dec 29, 2017
@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

We could/should also do this for Index/Series names as well, though it may break things.

@toobaz
Copy link
Member Author

toobaz commented Dec 29, 2017

I think we only allow actual None as a level-name (IOW its not set), so if you find a isna name that is not None I would raise on construction. I know that we could simply convert an isna name to None but better to be explicit I think.

I would agree if I wasn't scared that some (weird) data becomes unloadable (just a generic fear, I don't have any example in mind).

should also disallow a isna value when accessing a required level name.

Not following you here: you mean e.g. as argument to get_level_values()? But once they are forbidden as level names, then this will just reliably raise a KeyError which I think is fine.

@toobaz
Copy link
Member Author

toobaz commented Dec 29, 2017

We could/should also do this for Index/Series names as well, though it may break things.

On this I'm skeptical, because if we have a DataFrame with np.nan in its columns, accessing the relative column results in a Series named np.nan... and I don't think we want to break this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas MultiIndex
Projects
None yet
Development

No branches or pull requests

4 participants