BUG: has_duplicates misbehaves when multiindex has a NaN #5873

felixlawrence · 2014-01-08T06:33:14Z

When (at least) one element in a MultiIndex contains a NaN, has_duplicates starts to behave strangely:

>>> idx = pd.MultiIndex.from_arrays([[101, 102], [3.5, np.nan]])
>>> idx
MultiIndex
[(101, 3.5), (102, nan)]
>>> idx.has_duplicates
True
>>> idx.get_duplicates()
[]

I would expect has_duplicates to return False here, because 102 is not the same as 101.

I would also expect it to return false for the MultiIndex

MultiIndex
[(101, 3.5), (101, nan)]

since 3.5 != NaN, but this case is more debatable.

This is important because you can't call .unstack() on a series with a MultiIndex for which has_duplicates is True, even if the MultiIndex is of high dimension and the dimensions containing the NaN(s) are not involved in the operation.

This is with pandas 0.12.0

jreback · 2014-01-08T14:05:32Z

hmm...I'll call this a bug. FYI multi-index (or index) with nan is a very tricky issue in an of itself.

can you give a use case where this appears (e.g. using unstack)?

felixlawrence · 2014-01-09T00:39:29Z

Here's a simple example where unstack fails due to this:

idx = pd.MultiIndex.from_arrays([[101, 102], [3.5, np.nan]])
s = pd.Series([1,2], index=idx)
s.unstack()

I appreciate that nans are not the best thing to index by! Sadly, the data I've been given includes nans as an index (presumably in an attempt to indicate that that dimension of the index is not relevant to the particular sample).

Here's a slightly more complicated example that is closer to my actual data:

idx = pd.MultiIndex.from_arrays([['cat', 'cat', 'cat', 'dog', 'dog'],
                                 ['a', 'a', 'b', 'a', 'b'], 
                                 [1, 2, 1, 1, np.nan]])
s = pd.Series([1.3], index=idx)
print s
print s.unstack(level=0)

N.B. this particular example raises a different error:

ValueError: cannot convert float NaN to integer

which might be a separate issue.

felixlawrence · 2014-01-09T05:08:07Z

If (multi)index with nan is officially discouraged and causes trouble here and elsewhere (core.reshape._Unstacker also seems to struggle with nan indexes), perhaps pandas should print a warning message when an index is created with nan values?

jreback · 2014-01-09T13:15:53Z

see #5286

you should just avoid nan in mi's in general; I think a warning (controllable by an option) is a good idea. Their are some very non-trivial issues here.

jreback · 2015-01-02T15:41:39Z

closed by #9169

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

behzadnouri mentioned this issue Dec 30, 2014

TST: tests for GH5873 #9169

Closed

jreback closed this as completed Jan 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: has_duplicates misbehaves when multiindex has a NaN #5873

BUG: has_duplicates misbehaves when multiindex has a NaN #5873

felixlawrence commented Jan 8, 2014

jreback commented Jan 8, 2014

felixlawrence commented Jan 9, 2014

felixlawrence commented Jan 9, 2014

jreback commented Jan 9, 2014

jreback commented Jan 2, 2015

BUG: has_duplicates misbehaves when multiindex has a NaN #5873

BUG: has_duplicates misbehaves when multiindex has a NaN #5873

Comments

felixlawrence commented Jan 8, 2014

jreback commented Jan 8, 2014

felixlawrence commented Jan 9, 2014

felixlawrence commented Jan 9, 2014

jreback commented Jan 9, 2014

jreback commented Jan 2, 2015