Skip to content

NA is not included in MultiIndex.levels if we construct MI with nan #30750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
charlesdong1991 opened this issue Jan 6, 2020 · 9 comments
Open
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@charlesdong1991
Copy link
Member

charlesdong1991 commented Jan 6, 2020

If we construct MI with nan, and check the levels, output does not contain nan,

>>> tuples = [["A", "B"], ["A", np.nan], ["B", "A"]]
>>> mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
>>> mi.levels
FrozenList([['A', 'B'], ['A', 'B']])

Tracking it down, this is due to pd.Categorical does not include NA in categories:

>>> pd.Categorical(['a', 'b', None])
[a, b, NaN]
Categories (2, object): [a, b]

While inferred_type does indicate it is a mixed type, so np.nan should be accepted.

>>> tuples = [["A", "B"], ["A", np.nan], ["B", "A"]]
>>> mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
>>> mi.inferred_type
'mixed'

However, if the nan is gotten by operations, the nan is included in levels, e.g.

>>> l = [["a", np.NaN, 12, 12], [None, "a", 12.3, 33.], ["b", np.nan, 12.3, 123], ["a", "a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c", "d"])
>>> grouped = df.groupby(by=["a", "b"], dropna=False).sum()
>>> grouped.index.levels
FrozenList([['a', 'b', nan], ['a', nan]])

This is quite inconsistent though, is it an intended behaviour?

@ghost
Copy link

ghost commented Jan 6, 2020

This caused me a headache in the past - when merging together two dataframes by a multiindex, all the wrong rows seemed to match up, even when the multiindexes in each dataframe had the same values and the same inferred type. The root cause seemed to be because the nans were stored differently in the levels, as described above.

@TomAugspurger
Copy link
Contributor

cc @topper-123 if you have thoughts here.

@jorisvandenbossche
Copy link
Member

Related discussion: #29111

@TomAugspurger
Copy link
Contributor

Seems like this is a duplicate of #29111? Or are there differences?

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Jan 7, 2020

@TomAugspurger slightly different i think, since in the example of #29111 if defining the MI through levels and codes, we could actually see nan in levels, but losing the information of type of nan, while here the example is defining MI through from_tuples, and we will not see nan in levels. Anyway it is kind of inconsistency in levels in some sense.

Feel free to close if you think it is duplicate.

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Jan 7, 2020
@topper-123
Copy link
Contributor

nans not being in the levels/categories is a design choice in pandas and goes back to how Categoricals are implemented, as you mention and the levels are lists of all non-nan values. However the nan is still encodes in codes (as -1) and will be reconstructed when recoded to a different dtype. E.g. .to_frame will show the nan in the dataframe colum:

>>>: mi.to_frame()
       a    b
a b
A B    A    B
  NaN  A  NaN
B A    B    A

So while the Categorical probaly could have been implemented differently (e.g. the nan being the 0 index of the categories/level). It would also have the benefit that Categoricals could be based on Uint64Index insetad of Int64Index, which would have the additional benefit that we could double the unique values in a Categorical (256 for 8bit instead of 128 etc.). OTOH, changing that now might break backwards compatability in a big way, which won't fly.

I'd welcome thought if this can be implemented, but no (or maybe minimal) breakage would be needed for it to be accepted.

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Jan 8, 2020

thanks for your reply, @topper-123 !
emm, is adding dropna in Categorical and MI considered a potential bridge to achieve this while having no big breakage in your opinion?

@topper-123
Copy link
Contributor

Not sure I understand; dropna is already implemented in both index types?

But I doubt it's possible; e.g. just going fom Int64Index to UInt64Index would likely be very large change in itself. and adding nan to categories is also a huge change.

@charlesdong1991
Copy link
Member Author

yeah, sorry for bad interpretation, i meant to provide option to include nan in categorical and mi levels. and indeed, we have dropna in both

@topper-123 but you are right, there seems a lot of changes to happen for such change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants