NA is not included in MultiIndex.levels if we construct MI with nan #30750

charlesdong1991 · 2020-01-06T18:30:30Z

If we construct MI with nan, and check the levels, output does not contain nan,

>>> tuples = [["A", "B"], ["A", np.nan], ["B", "A"]]
>>> mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
>>> mi.levels
FrozenList([['A', 'B'], ['A', 'B']])

Tracking it down, this is due to pd.Categorical does not include NA in categories:

>>> pd.Categorical(['a', 'b', None])
[a, b, NaN]
Categories (2, object): [a, b]

While inferred_type does indicate it is a mixed type, so np.nan should be accepted.

>>> tuples = [["A", "B"], ["A", np.nan], ["B", "A"]]
>>> mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
>>> mi.inferred_type
'mixed'

However, if the nan is gotten by operations, the nan is included in levels, e.g.

>>> l = [["a", np.NaN, 12, 12], [None, "a", 12.3, 33.], ["b", np.nan, 12.3, 123], ["a", "a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c", "d"])
>>> grouped = df.groupby(by=["a", "b"], dropna=False).sum()
>>> grouped.index.levels
FrozenList([['a', 'b', nan], ['a', nan]])

This is quite inconsistent though, is it an intended behaviour?

The text was updated successfully, but these errors were encountered:

ghost · 2020-01-06T21:00:49Z

This caused me a headache in the past - when merging together two dataframes by a multiindex, all the wrong rows seemed to match up, even when the multiindexes in each dataframe had the same values and the same inferred type. The root cause seemed to be because the nans were stored differently in the levels, as described above.

TomAugspurger · 2020-01-06T21:25:03Z

cc @topper-123 if you have thoughts here.

jorisvandenbossche · 2020-01-07T07:42:25Z

Related discussion: #29111

TomAugspurger · 2020-01-07T12:07:49Z

Seems like this is a duplicate of #29111? Or are there differences?

charlesdong1991 · 2020-01-07T12:13:05Z

@TomAugspurger slightly different i think, since in the example of #29111 if defining the MI through levels and codes, we could actually see nan in levels, but losing the information of type of nan, while here the example is defining MI through from_tuples, and we will not see nan in levels. Anyway it is kind of inconsistency in levels in some sense.

Feel free to close if you think it is duplicate.

topper-123 · 2020-01-08T17:36:26Z

nans not being in the levels/categories is a design choice in pandas and goes back to how Categoricals are implemented, as you mention and the levels are lists of all non-nan values. However the nan is still encodes in codes (as -1) and will be reconstructed when recoded to a different dtype. E.g. .to_frame will show the nan in the dataframe colum:

>>>: mi.to_frame()
       a    b
a b
A B    A    B
  NaN  A  NaN
B A    B    A

So while the Categorical probaly could have been implemented differently (e.g. the nan being the 0 index of the categories/level). It would also have the benefit that Categoricals could be based on Uint64Index insetad of Int64Index, which would have the additional benefit that we could double the unique values in a Categorical (256 for 8bit instead of 128 etc.). OTOH, changing that now might break backwards compatability in a big way, which won't fly.

I'd welcome thought if this can be implemented, but no (or maybe minimal) breakage would be needed for it to be accepted.

charlesdong1991 · 2020-01-08T20:00:03Z

thanks for your reply, @topper-123 !
emm, is adding dropna in Categorical and MI considered a potential bridge to achieve this while having no big breakage in your opinion?

topper-123 · 2020-01-08T20:25:09Z

Not sure I understand; dropna is already implemented in both index types?

But I doubt it's possible; e.g. just going fom Int64Index to UInt64Index would likely be very large change in itself. and adding nan to categories is also a huge change.

charlesdong1991 · 2020-01-08T20:52:02Z

yeah, sorry for bad interpretation, i meant to provide option to include nan in categorical and mi levels. and indeed, we have dropna in both

@topper-123 but you are right, there seems a lot of changes to happen for such change.

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Jan 7, 2020

charlesdong1991 mentioned this issue Feb 10, 2020

ENH: Add dropna in groupby to allow NaN in keys #30584

Merged

5 tasks

mroeschke added Bug Needs Discussion Requires discussion from core team before further action labels Jul 25, 2021

alecchap mentioned this issue Sep 9, 2021

copy drops rows when Strand is nan pyranges/pyranges#214

Open

RobinFiveWords mentioned this issue Jul 20, 2022

BUG: concat gives incorrect result when MultiIndex values are all NA #47802

Closed

3 tasks

coroa mentioned this issue Jun 25, 2023

isna does not work with explicit MultiIndex nan-representation coroa/pandas-indexing#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NA is not included in MultiIndex.levels if we construct MI with nan #30750

NA is not included in MultiIndex.levels if we construct MI with nan #30750

charlesdong1991 commented Jan 6, 2020 •

edited

Loading

ghost commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

charlesdong1991 commented Jan 7, 2020 •

edited

Loading

topper-123 commented Jan 8, 2020

charlesdong1991 commented Jan 8, 2020 •

edited

Loading

topper-123 commented Jan 8, 2020

charlesdong1991 commented Jan 8, 2020

NA is not included in MultiIndex.levels if we construct MI with nan #30750

NA is not included in MultiIndex.levels if we construct MI with nan #30750

Comments

charlesdong1991 commented Jan 6, 2020 • edited Loading

ghost commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

charlesdong1991 commented Jan 7, 2020 • edited Loading

topper-123 commented Jan 8, 2020

charlesdong1991 commented Jan 8, 2020 • edited Loading

topper-123 commented Jan 8, 2020

charlesdong1991 commented Jan 8, 2020

charlesdong1991 commented Jan 6, 2020 •

edited

Loading

charlesdong1991 commented Jan 7, 2020 •

edited

Loading

charlesdong1991 commented Jan 8, 2020 •

edited

Loading