-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Special type for qcut/cut to enable sorting. #5314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Another thing that would be good to add to Categorical is the ability to categorize an exogenous dataset using already defined categories (namely, I categegorized on some dataset and want to categorize another set based on those categories). There may be a way to do this already, but it wasn't obvious from looking at the Categorical code and the categories were less than useful as strings. |
@rockg do you mean with a quantitative or categorical/ordinal variable? If you mean quantitative then, yes, this will be possible. I'm making a Bin class that will have left and right bounds + left/right inclusiveness to wrap this up. Then it would just be a tweak of |
@jtratner Yes, with the levels variable in Categorical (which seems like your making more useful). Your example is precisely what I'm thinking of (along with perhaps another argument to have the first and last bins to go -Inf and Inf, respectively, if that behavior is wanted). |
that said, because of how the internals work, at least for the moment, it will error out if you pass a set of levels and not all values are contained within them. In other words, this will work: catquan = cut([1, 2, 3, 4, 5], 3)
catquan2 = cut([3, 4], bins=catquan.levels)
assert catquan.levels.equals(catquan2.levels) # True This won't work: catquan = cut([1, 3, 4], 2)
catquan2 = cut([1, 2, 3, 4, 5], bins=catquan.levels)
TypeError: Found value(s) not contained within passed level set. |
I guess it could be nan instead though... And eventually would handle inf |
Another option would be to have a modify levels method that would return a new set of levels with -Inf and Inf which could be passed in as bins. I guess it's a question of whether cut/qcut should handle it or not. |
yep, there's a bunch of ways this could work. Let me get something working for this and then we can loop back to the issue of open intervals. |
Not sure what I was thinking in my earlier comment - cut leaves as Nan if it's not in range of the bins. |
I recall discussing this before but came up on the mailing list again. Sorting of cut output is always wrong / ugly. I propose creating a str-like that compares with others of the same type as if they were tuples (possibly with nice logic around ranges) - would be facilitated by intro of CategoricalBlock.
Similar to replacing qcut output with this:
The text was updated successfully, but these errors were encountered: