Skip to content

BUG: has_duplicates misbehaves when multiindex has a NaN #5873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
felixlawrence opened this issue Jan 8, 2014 · 5 comments
Closed

BUG: has_duplicates misbehaves when multiindex has a NaN #5873

felixlawrence opened this issue Jan 8, 2014 · 5 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex
Milestone

Comments

@felixlawrence
Copy link

When (at least) one element in a MultiIndex contains a NaN, has_duplicates starts to behave strangely:

>>> idx = pd.MultiIndex.from_arrays([[101, 102], [3.5, np.nan]])
>>> idx
MultiIndex
[(101, 3.5), (102, nan)]
>>> idx.has_duplicates
True
>>> idx.get_duplicates()
[]

I would expect has_duplicates to return False here, because 102 is not the same as 101.

I would also expect it to return false for the MultiIndex

MultiIndex
[(101, 3.5), (101, nan)]

since 3.5 != NaN, but this case is more debatable.

This is important because you can't call .unstack() on a series with a MultiIndex for which has_duplicates is True, even if the MultiIndex is of high dimension and the dimensions containing the NaN(s) are not involved in the operation.

This is with pandas 0.12.0

@jreback
Copy link
Contributor

jreback commented Jan 8, 2014

hmm...I'll call this a bug. FYI multi-index (or index) with nan is a very tricky issue in an of itself.

can you give a use case where this appears (e.g. using unstack)?

@felixlawrence
Copy link
Author

Here's a simple example where unstack fails due to this:

idx = pd.MultiIndex.from_arrays([[101, 102], [3.5, np.nan]])
s = pd.Series([1,2], index=idx)
s.unstack()

I appreciate that nans are not the best thing to index by! Sadly, the data I've been given includes nans as an index (presumably in an attempt to indicate that that dimension of the index is not relevant to the particular sample).

Here's a slightly more complicated example that is closer to my actual data:

idx = pd.MultiIndex.from_arrays([['cat', 'cat', 'cat', 'dog', 'dog'],
                                 ['a', 'a', 'b', 'a', 'b'], 
                                 [1, 2, 1, 1, np.nan]])
s = pd.Series([1.3], index=idx)
print s
print s.unstack(level=0)

N.B. this particular example raises a different error:

ValueError: cannot convert float NaN to integer

which might be a separate issue.

@felixlawrence
Copy link
Author

If (multi)index with nan is officially discouraged and causes trouble here and elsewhere (core.reshape._Unstacker also seems to struggle with nan indexes), perhaps pandas should print a warning message when an index is created with nan values?

@jreback
Copy link
Contributor

jreback commented Jan 9, 2014

see #5286

you should just avoid nan in mi's in general; I think a warning (controllable by an option) is a good idea. Their are some very non-trivial issues here.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

closed by #9169

@jreback jreback closed this as completed Jan 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants