Skip to content

BUG: alignment of CategoricalIndex #13440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Jun 14, 2016

@jreback take this just as a wild guess of how to fix #13365 . If the approach makes sense, I will add tests, whatsnew and some more docs.

@codecov-io
Copy link

codecov-io commented Jun 14, 2016

Current coverage is 84.34%

Merging #13440 into master will increase coverage by <.01%

@@             master     #13440   diff @@
==========================================
  Files           138        138          
  Lines         51107      51124    +17   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43103      43119    +16   
- Misses         8004       8005     +1   
  Partials          0          0          

Powered by Codecov. Last updated by bf4786a...7f4a72a

@jreback
Copy link
Contributor

jreback commented Jun 14, 2016

you are adding a crazy amount of code here.

"a.categories" must already contain all categories that "values" uses.
"""

# No vectorized way to build the mapping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.get_indexer

@toobaz
Copy link
Member Author

toobaz commented Jun 14, 2016

Thanks for the .get_indexer() suggestion, fixed. Apart from that, do you have in mind a simpler approach? What I could do would be to override the entire _join_non_unique(), duplicating some lines of code but removing the need for the two new methods currently used in the CategoricalIndex- specific parts.

@sinhrks sinhrks added Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Jun 14, 2016
@toobaz
Copy link
Member Author

toobaz commented Jun 21, 2016

@jreback : what do you want to do?

  1. I complete the current approach with tests and docs
  2. I override the entire _join_non_unique() instead that adding two new methods
  3. you suggest another approach
  4. we close this

@jreback
Copy link
Contributor

jreback commented Jun 23, 2016

The approach is not bad, but the actual impl seems not very elegant. Further your impl of _array_putmask is not good.

@toobaz
Copy link
Member Author

toobaz commented Jun 23, 2016

Any constructive criticism? :-) My mood doesn't depend on them, but maybe this PR/issue does.

@jreback
Copy link
Contributor

jreback commented Jun 23, 2016

maybe @sinhrks has some

@sinhrks
Copy link
Member

sinhrks commented Jun 24, 2016

I may misunderstand, but the point is how pandas handle extended dtype? We should decide whether to make them more numpy compat, or use separated logic using if-else branch.

I think most of other impl takes latter (if-else) option currently.

@toobaz
Copy link
Member Author

toobaz commented Jun 24, 2016

@sinhrks : yes, the bug is due to a numpy method (putmask()) not working (and it's not a mere problem of type checks, there's really missing logic, which this PR introduces) on Categorical.

Certainly the cleanest option (for code organization, not that it would spare us from including the missing logic) would be that Categorical inherits from np.ndarray. But that's obviously a huge work... and might (I don't know enough to judge) even be plain impossible due to the in-memory structure of Categorical being different from a typical ndarray.

Then sure, we can solve this also with "if-else"s without overriding... I have hard time considering such approach more "elegant" than mine, but if it is the path of least resistance to closing the bug I'll update the PR.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2016

@toobaz I wasn't looking for a general 'why are we doing things this way', that is obvious. The issue is you are jumping thru hoops and not using the existing Categorical machinery

"""
Must be overridden by Indexes which have non standard
self.values.take() keyword args (e.g. CategoricalIndex).
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really really odd way to do this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... as already said, if you want me I override the entire _join_non_unique I will, I don't see many other alternatives currently.

@toobaz
Copy link
Member Author

toobaz commented Jun 25, 2016

OK, now I'm (extending and) using union_categorical() (thanks for the tip - although I don't have clear what you mean by "violates the guarantee"), and overriding directly _join_non_unique() so that the change is much less invasive.

@toobaz
Copy link
Member Author

toobaz commented Jul 9, 2016

@jreback ping

Feel free to close, I won't be offended.

@jreback
Copy link
Contributor

jreback commented Jul 9, 2016

yep, don't really have time to look, but needs some refactoring. if you wan to submit again pls feel free.

@jreback jreback closed this Jul 9, 2016
@toobaz
Copy link
Member Author

toobaz commented Jul 9, 2016

Submit again?! Well... what?!

I guess the bug(s) will be fixed by somebody who understands how contributing to this project works. I tried in vain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alignment of CategoricalIndex
4 participants