Skip to content

MultiIndex.get_level_values() replaces NA by another value #5074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
goyodiaz opened this issue Oct 1, 2013 · 12 comments · Fixed by #5090
Closed

MultiIndex.get_level_values() replaces NA by another value #5074

goyodiaz opened this issue Oct 1, 2013 · 12 comments · Fixed by #5090
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@goyodiaz
Copy link
Contributor

goyodiaz commented Oct 1, 2013

Test case:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: index = pd.MultiIndex.from_arrays([
    ['a', 'b', 'b'],
    [1, np.nan, 2]
])

In [4]: index.get_level_values(1)
Out[4]: Float64Index([1.0, 2.0, 2.0], dtype=object)

The expected output is
Float64Index([1.0, nan, 2.0], dtype=object)

This happens because NA values are not stored in the MultiIndex levels and the corresponding label is set to -1. Then when labels are used as indexes to values in get_level_values() that -1 points to the last (not null) value.

I tried to fix this by appending a NA to the values if -1 is in levels.
https://github.com/goyodiaz/pandas/commit/f028513ad96a
It needs to be improved in order to return the proper NA value (NaN, None, maybe NaT?) depending on the index type. Does this approach makes sense?

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

not a bad idea (this actually makes take deal with this correctly).

you can add a method _fill_value and have Index return np.nan and override in DatetimeIndex return NaT

need some more tests, edge cases, e.g. empty index, multiple nan (datetime w/NaT), -1 at the end, beginning (you have in the middle case)

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

Better fix is to create a mask based on value == -1 and then fill with nan
afterwards. Otherwise you get weird behavior with setting levels and this
fix would complicate any future function that would remove extraneous
labels. Plus you'd always have to check for and remove null when outputting
levels. Might create a flag that checks for nan values so you use dtype
float instead. (or just check mask.any())

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

@goyodiaz something like (using @jtratner method)

mask = labels == -1
values = unique_values.take(labels)
values[mask] = np.nan

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

@goyodiaz do you have Travis set up? Interested if your solution passes. I didn't realize you were saying to just do it temporarily. I still think it'll create issues if you wanted to shorten levels later on, but I'm less convinced than in my previous comment :)

@jreback does append create a copy of the underlying memory? I guess it's happening in either case, but bool certainly smaller

slight tweak:

mask = labels == -1
values = unique_values.take(labels)
if mask.any():
    values = values.astype(float)
    values[mask] = np.nan

@jreback
Copy link
Contributor

jreback commented Oct 2, 2013

yes append copies

@goyodiaz
Copy link
Contributor Author

goyodiaz commented Oct 2, 2013

@jtratner yes, travis builds passed.
https://travis-ci.org/goyodiaz/pandas

Can you think of any possible side effect which should be tested? I did not understand well your concerns.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2013

go ahead an open a pull-request, this will submit as a patch to the devs

@goyodiaz
Copy link
Contributor Author

goyodiaz commented Oct 2, 2013

Will do it in a while.

BTW there is nothing to fix with DatetimeIndex:

In [1]: index = pd.MultiIndex.from_arrays([
    pd.DatetimeIndex([pd.NaT, 0, 1]),
    ['a', 'b', 'b']
])
In [2]: index.get_level_values(0)[0]
Out[2]: NaT

But I guess this could change when numpy get proper integer nan support, if ever.

@jtratner
Copy link
Contributor

jtratner commented Oct 2, 2013

@goyodiaz - it's totally fine, I appreciate you figuring out what was going
on! I was thinking about this from an internals perspective (i.e., space
and time complexity for masking vs. appending a value to the end of a list)
and also what I'm looking to do going forward.

That said, figuring out that the issue was that it was taking the wrong
value was very helpful - thanks for that!

@jtratner
Copy link
Contributor

jtratner commented Oct 2, 2013

Behavior with NaT actually could be a bug, not sure.

@goyodiaz
Copy link
Contributor Author

goyodiaz commented Oct 2, 2013

@jratner I think you are right, it's a bug in factorize() if it is supposed to return unique values without missing values and NaT is to be treated as a missing value.

@goyodiaz
Copy link
Contributor Author

goyodiaz commented Oct 2, 2013

I guess I should link the PR: #5090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants