MultiIndex.get_level_values() replaces NA by another value #5074

goyodiaz · 2013-10-01T21:30:23Z

Test case:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: index = pd.MultiIndex.from_arrays([
    ['a', 'b', 'b'],
    [1, np.nan, 2]
])

In [4]: index.get_level_values(1)
Out[4]: Float64Index([1.0, 2.0, 2.0], dtype=object)

The expected output is
Float64Index([1.0, nan, 2.0], dtype=object)

This happens because NA values are not stored in the MultiIndex levels and the corresponding label is set to -1. Then when labels are used as indexes to values in get_level_values() that -1 points to the last (not null) value.

I tried to fix this by appending a NA to the values if -1 is in levels.
https://github.com/goyodiaz/pandas/commit/f028513ad96a
It needs to be improved in order to return the proper NA value (NaN, None, maybe NaT?) depending on the index type. Does this approach makes sense?

The text was updated successfully, but these errors were encountered:

jreback · 2013-10-01T21:39:10Z

not a bad idea (this actually makes take deal with this correctly).

you can add a method _fill_value and have Index return np.nan and override in DatetimeIndex return NaT

need some more tests, edge cases, e.g. empty index, multiple nan (datetime w/NaT), -1 at the end, beginning (you have in the middle case)

jtratner · 2013-10-01T22:56:58Z

Better fix is to create a mask based on value == -1 and then fill with nan
afterwards. Otherwise you get weird behavior with setting levels and this
fix would complicate any future function that would remove extraneous
labels. Plus you'd always have to check for and remove null when outputting
levels. Might create a flag that checks for nan values so you use dtype
float instead. (or just check mask.any())

jreback · 2013-10-01T23:49:49Z

@goyodiaz something like (using @jtratner method)

mask = labels == -1
values = unique_values.take(labels)
values[mask] = np.nan

jtratner · 2013-10-01T23:54:33Z

@goyodiaz do you have Travis set up? Interested if your solution passes. I didn't realize you were saying to just do it temporarily. I still think it'll create issues if you wanted to shorten levels later on, but I'm less convinced than in my previous comment :)

@jreback does append create a copy of the underlying memory? I guess it's happening in either case, but bool certainly smaller

slight tweak:

mask = labels == -1
values = unique_values.take(labels)
if mask.any():
    values = values.astype(float)
    values[mask] = np.nan

jreback · 2013-10-02T00:14:47Z

yes append copies

goyodiaz · 2013-10-02T16:15:54Z

@jtratner yes, travis builds passed.
https://travis-ci.org/goyodiaz/pandas

Can you think of any possible side effect which should be tested? I did not understand well your concerns.

jreback · 2013-10-02T16:19:34Z

go ahead an open a pull-request, this will submit as a patch to the devs

goyodiaz · 2013-10-02T17:23:17Z

Will do it in a while.

BTW there is nothing to fix with DatetimeIndex:

In [1]: index = pd.MultiIndex.from_arrays([
    pd.DatetimeIndex([pd.NaT, 0, 1]),
    ['a', 'b', 'b']
])
In [2]: index.get_level_values(0)[0]
Out[2]: NaT

But I guess this could change when numpy get proper integer nan support, if ever.

jtratner · 2013-10-02T21:17:01Z

@goyodiaz - it's totally fine, I appreciate you figuring out what was going
on! I was thinking about this from an internals perspective (i.e., space
and time complexity for masking vs. appending a value to the end of a list)
and also what I'm looking to do going forward.

That said, figuring out that the issue was that it was taking the wrong
value was very helpful - thanks for that!

jtratner · 2013-10-02T21:17:43Z

Behavior with NaT actually could be a bug, not sure.

goyodiaz · 2013-10-02T21:52:28Z

@jratner I think you are right, it's a bug in factorize() if it is supposed to return unique values without missing values and NaT is to be treated as a missing value.

goyodiaz · 2013-10-02T21:59:44Z

I guess I should link the PR: #5090

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074)

goyodiaz mentioned this issue Oct 5, 2013

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

Merged

jtratner closed this as completed in #5090 Oct 7, 2013

jtratner added a commit that referenced this issue Oct 7, 2013

Merge pull request #5090 from goyodiaz/multiindex-nan

504e69b

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074)

PKEuS mentioned this issue May 15, 2014

ENH: merge multi-index with a multi-index #6360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiIndex.get_level_values() replaces NA by another value #5074

MultiIndex.get_level_values() replaces NA by another value #5074

goyodiaz commented Oct 1, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 2, 2013

goyodiaz commented Oct 2, 2013

jreback commented Oct 2, 2013

goyodiaz commented Oct 2, 2013

jtratner commented Oct 2, 2013

jtratner commented Oct 2, 2013

goyodiaz commented Oct 2, 2013

goyodiaz commented Oct 2, 2013

MultiIndex.get_level_values() replaces NA by another value #5074

MultiIndex.get_level_values() replaces NA by another value #5074

Comments

goyodiaz commented Oct 1, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 2, 2013

goyodiaz commented Oct 2, 2013

jreback commented Oct 2, 2013

goyodiaz commented Oct 2, 2013

jtratner commented Oct 2, 2013

jtratner commented Oct 2, 2013

goyodiaz commented Oct 2, 2013

goyodiaz commented Oct 2, 2013