BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

goyodiaz · 2013-10-02T21:44:18Z

Do you really prefer using a mask? It seems to have a lower memory footprint but it requires some tricks for corner cases.

num = self._get_level_number(level)
unique_vals = self.levels[num]  # .values
labels = self.labels[num]
mask = labels == -1
if len(unique_vals) > 0:
    values = unique_vals.take(labels)
else:
    values = np.empty(len(labels))
    values[:] = np.nan
    values = pd.Index(values)
if mask.any():
    values = values.get_values()
    values[mask] = np.nan
    values = pd.Index(values)
values.name = self.names[num]
return values

jtratner · 2013-10-02T22:56:35Z

Yes, your original answer is simplest (especially because it lets numpy handle all the casting) - it's a nice combination of clever and clear compared to what I was thinking. :) There's a pathological case where this would be far less efficient (MI initialized with many many labels, then sliced to only have a subset) because the append creates a copy, but I think your solution is better because it's less error-prone. Thanks for taking the time to work through this example!

Please make sure you're testing this with labels that are string, date-like, integer and heterogeneous, as well as levels > 0.

@jreback should this fill with NaT if it's datelike?

jtratner · 2013-10-02T22:59:06Z

Separate question at those who might know (@jreback @y-p et al), should MI verify that levels don't include nan and then try to consolidate downwards if they do or is it fine to just not care about that?

jreback · 2013-10-02T23:00:00Z

yep date like should be NaT

NaT is prob not missing considered missing because it's just using a view of i8 so it is preserved - so it 'works' correctly

jtratner · 2013-10-02T23:07:24Z

The problem is that, with the append method, it ends up forcing to dtype object if you append nan, so you need to check. Does that already exist? maybe in common there's something like is_datelike that wraps up what you need to check for that?

goyodiaz · 2013-10-03T07:44:26Z

Given the current behavoir of factorize(), which does not treat NaT as a missing value, we are never going to append a nan where a NaT shoud go, so no need to care. But if that behavoir is actually a bug then we have to take care of the index type and choose the appropiate NA value.

OTOH, it is worth testing get_level_values() with date-like levels and NaT, but I do not think it belongs to this PR since it is working as expected currently.

goyodiaz · 2013-10-03T11:45:18Z

On second thought, using a private method returning the correct NA value for an Index (as proposed before by @jreback) makes sense anyway. This decouples get_level_values() from factorize() which is good and then we do not have to care about which values are removed by factorize() and which are not, instead we are ready to reconstruct the NA values whenever they have been removed. It also provides a rationale for testing NaT here.

jtratner · 2013-10-03T12:26:16Z

@jreback would take_1d in common be useful for this?

jreback · 2013-10-03T12:36:34Z

yep

com.take_nd(arr, indexer, axis=0, fill_value=fill_value)

you can set fill_value from a method (which DatetimeIndex) can override, something like _fill_value

def _fill_value(self):
   return np.nan

in DatetimeIndex

def _fill_value(self):
     return pd.NaT

goyodiaz · 2013-10-05T20:48:50Z

I used _na_value() instead of _fill_value() because fill_value is used with a different meaning (quite the opposite actually) in other contexts. I hope it's OK.

jreback · 2013-10-05T23:33:11Z

@jtratner this looks fine to me...any more comments?

jtratner · 2013-10-05T23:34:47Z

pandas/core/index.py

@@ -394,6 +394,10 @@ def values(self):
    def get_values(self):
        return self.values

+    def _na_value(self):


why can't this just be a property? - much simpler.

#: the expected NA value to use with this index _na_value = np.nan

etc.

jreback · 2013-10-07T13:17:11Z

@jtratner @cpcloud anything else?

cpcloud · 2013-10-07T14:25:34Z

@goyodiaz need a rebase

jreback · 2013-10-07T20:52:32Z

@jtratner go ahead and merge

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074)

goyodiaz mentioned this pull request Oct 2, 2013

MultiIndex.get_level_values() replaces NA by another value #5074

Closed

jtratner reviewed Oct 5, 2013
View reviewed changes

goyodiaz added 2 commits October 7, 2013 21:44

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074)

1b3f381

Make _na_value a class attribute.

573fee6

jtratner added a commit that referenced this pull request Oct 7, 2013

Merge pull request #5090 from goyodiaz/multiindex-nan

504e69b

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074)

jtratner merged commit 504e69b into pandas-dev:master Oct 7, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

goyodiaz commented Oct 2, 2013

jtratner commented Oct 2, 2013

jtratner commented Oct 2, 2013

jreback commented Oct 2, 2013

jtratner commented Oct 2, 2013

goyodiaz commented Oct 3, 2013

goyodiaz commented Oct 3, 2013

jtratner commented Oct 3, 2013

jreback commented Oct 3, 2013

goyodiaz commented Oct 5, 2013

jreback commented Oct 5, 2013

jtratner Oct 5, 2013

jreback commented Oct 7, 2013

cpcloud commented Oct 7, 2013

jreback commented Oct 7, 2013

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

Conversation

goyodiaz commented Oct 2, 2013

jtratner commented Oct 2, 2013

jtratner commented Oct 2, 2013

jreback commented Oct 2, 2013

jtratner commented Oct 2, 2013

goyodiaz commented Oct 3, 2013

goyodiaz commented Oct 3, 2013

jtratner commented Oct 3, 2013

jreback commented Oct 3, 2013

goyodiaz commented Oct 5, 2013

jreback commented Oct 5, 2013

jtratner Oct 5, 2013

Choose a reason for hiding this comment

jreback commented Oct 7, 2013

cpcloud commented Oct 7, 2013

jreback commented Oct 7, 2013