Skip to content

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 7, 2013
Merged

BUG: MultiIndex.get_level_values() replaces NA by another value (#5074) #5090

merged 2 commits into from
Oct 7, 2013

Conversation

goyodiaz
Copy link
Contributor

@goyodiaz goyodiaz commented Oct 2, 2013

closes #5074

Do you really prefer using a mask? It seems to have a lower memory footprint but it requires some tricks for corner cases.

num = self._get_level_number(level)
unique_vals = self.levels[num]  # .values
labels = self.labels[num]
mask = labels == -1
if len(unique_vals) > 0:
    values = unique_vals.take(labels)
else:
    values = np.empty(len(labels))
    values[:] = np.nan
    values = pd.Index(values)
if mask.any():
    values = values.get_values()
    values[mask] = np.nan
    values = pd.Index(values)
values.name = self.names[num]
return values

@jtratner
Copy link
Contributor

jtratner commented Oct 2, 2013

Yes, your original answer is simplest (especially because it lets numpy handle all the casting) - it's a nice combination of clever and clear compared to what I was thinking. :) There's a pathological case where this would be far less efficient (MI initialized with many many labels, then sliced to only have a subset) because the append creates a copy, but I think your solution is better because it's less error-prone. Thanks for taking the time to work through this example!

Please make sure you're testing this with labels that are string, date-like, integer and heterogeneous, as well as levels > 0.

@jreback should this fill with NaT if it's datelike?

@jtratner
Copy link
Contributor

jtratner commented Oct 2, 2013

Separate question at those who might know (@jreback @y-p et al), should MI verify that levels don't include nan and then try to consolidate downwards if they do or is it fine to just not care about that?

@jreback
Copy link
Contributor

jreback commented Oct 2, 2013

yep date like should be NaT

NaT is prob not missing considered missing because it's just using a view of i8 so it is preserved - so it 'works' correctly

@jtratner
Copy link
Contributor

jtratner commented Oct 2, 2013

The problem is that, with the append method, it ends up forcing to dtype object if you append nan, so you need to check. Does that already exist? maybe in common there's something like is_datelike that wraps up what you need to check for that?

@goyodiaz
Copy link
Contributor Author

goyodiaz commented Oct 3, 2013

Given the current behavoir of factorize(), which does not treat NaT as a missing value, we are never going to append a nan where a NaT shoud go, so no need to care. But if that behavoir is actually a bug then we have to take care of the index type and choose the appropiate NA value.

OTOH, it is worth testing get_level_values() with date-like levels and NaT, but I do not think it belongs to this PR since it is working as expected currently.

@goyodiaz
Copy link
Contributor Author

goyodiaz commented Oct 3, 2013

On second thought, using a private method returning the correct NA value for an Index (as proposed before by @jreback) makes sense anyway. This decouples get_level_values() from factorize() which is good and then we do not have to care about which values are removed by factorize() and which are not, instead we are ready to reconstruct the NA values whenever they have been removed. It also provides a rationale for testing NaT here.

@jtratner
Copy link
Contributor

jtratner commented Oct 3, 2013

@jreback would take_1d in common be useful for this?

@jreback
Copy link
Contributor

jreback commented Oct 3, 2013

yep

com.take_nd(arr, indexer, axis=0, fill_value=fill_value)

you can set fill_value from a method (which DatetimeIndex) can override, something like _fill_value

def _fill_value(self):
   return np.nan

in DatetimeIndex

def _fill_value(self):
     return pd.NaT

@goyodiaz
Copy link
Contributor Author

goyodiaz commented Oct 5, 2013

I used _na_value() instead of _fill_value() because fill_value is used with a different meaning (quite the opposite actually) in other contexts. I hope it's OK.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2013

@jtratner this looks fine to me...any more comments?

@@ -394,6 +394,10 @@ def values(self):
def get_values(self):
return self.values

def _na_value(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't this just be a property? - much simpler.

#: the expected NA value to use with this index
_na_value = np.nan

etc.

@jreback
Copy link
Contributor

jreback commented Oct 7, 2013

@jtratner @cpcloud anything else?

@cpcloud
Copy link
Member

cpcloud commented Oct 7, 2013

@goyodiaz need a rebase

@jreback
Copy link
Contributor

jreback commented Oct 7, 2013

@jtratner go ahead and merge

jtratner added a commit that referenced this pull request Oct 7, 2013
BUG: MultiIndex.get_level_values() replaces NA by another value (#5074)
@jtratner jtratner merged commit 504e69b into pandas-dev:master Oct 7, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiIndex.get_level_values() replaces NA by another value
4 participants