-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
WIP: categoricals as an internal CategoricalBlock GH5313 #7217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @immerrr |
if isinstance(other, compat.string_types): | ||
return other == self.name | ||
else: | ||
return other == self.base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @immerrr if you have any ideas on how to make this more 'real', lmk. (e.g. this simply is not a sub-class of np.dtype
, nor do I think you can actually make one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you can, but it's not trivial, see quaternion which is a "canonical" example of dtype subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, quaternion is not a sub class, rather a new class, so you may be right about subclassing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh...that looked complicated. and I really need the sub-class what I have does work (its almost for display than anything)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of bells and whistles in that example, that's true. A minimal viable new dtype can be found here .
@jseabold feel free to jump in here for commentary |
@jreback See https://github.com/JanSchulz/pandas/tree/categorical_improvements
What doesn't work yet:
|
ok will have a look why did you change labels -> _values_pointer on the surface this seems to break back compat is that true? |
The problem with the "old" labels was that they were "pointers" and labels in R have a different meaning: R: labels are other names for levels pseudo code for future label usage, which is IMO compatible with R
|
One question: how do one get to the underlying data structure from Series?
That would be the most interesting feature from ggplots side: we want to get the levels/labels from the dataframe which is passed to the plotting functions |
ok so labels are descriptors of the levels (if provided) hmm what if we call values_pointers: locs or index? |
Series(Categorical(....)).values returns the Categorical itself |
however we could define labels/levels as a property on Series that only works on categorical series (I think), or a method if u need to pass in arguments |
I don't mind the names, I just think the "pointers" are an internal implementation detail and shouldn't show up in tab completion. |
I'm off to bed, will be back tomorrow evening... |
@JanSchulz do you have any idea where/what purpose of |
ok I have a new version which incorporates your changes
what's still missing? |
After a bit more poking at Rs factors, I think labels there are kind of waste of time: I think changing levels in R will do renaming/reducing/appending of levels. It seems labels are simple there to change the names of the levels in one go during creation of a factor, but that can also be done afterwards with changing the levels. So I'm fine with that change back from pointers to labels, although I still find it strange, as from my (German, so non-English) POV labels are "names for something", so actually I would have expected the other end to be the labels ("pointers point to labels"). in R, a categorical is always strings (so the first thing would be to convert values to strings), but I think we can miss that. Anyway, I think the main part missing is a API which reorders/reduces/expand levels based on an input array/list and renaming of levels:
I also think that operations based on categorical levels (groupby, etc) should return groups for all levels and not only the ones, which have values -> empty groups for levels without values. |
Note: I am not really following this, or a user of categorical types at the moment. But, seeing this PR and the discussion, and as this is a really big and important new feature and addition to the pandas 'language' (not just a new method/function), I was wondering if it would be better to first lay-out the design (API, naming, ...) in a kind of 'design document' (something like a PEP for pandas)? Just from a brief look, it seems there is still some discussion about the naming of things, how certain operations should work exactly, ... This is also something that should be discussed broader I think (with more people who would use this, send it out to the list), but it is now difficult to engage in the discussion with only the code to look at and the discussion in this PR. Having an overview with the reasoning behind this new type, the naming, the API, how you create it, how to do common operations on it, how it behaves in other pandas methods, some examples of applications, (maybe a short comparison with R), ... could be beneficial for making this a more solid enhancement, but also to facilitate the dicussion (and afterwards this can be used as a start for some documentation, so certainly not 'wasted' in that regard). Just a remark from the sideline. What do you think? |
I think someone from the statistics side should comment on this: cc @jseabold @josef-pkt and @cancan101 @kshedden @upandacross @cfarmer because they had some issues open which showed up after a search for "categorical" in Statsmodels I come from ggplot and plotting and I'm actually after the "reorder bar charts" (-> reorder levels)and "make faceting easier by letting empty levels show up in groupby" feature (see linked issues above). I can do a comparison with R: creation, adding to a df, reorder levels, change levels, groupby. Something else? |
What also needs to be tested is "add to a df, sort the df on another column, see that the categorical series is changed accordingly". Also selection: |
@jreback Thanks! I will try to add a few tests and see if everything works by tomorrow afternoon. It might happen that I don't get that far and I will be traveling until thursday. |
np this is going to let merge until next week anyhow (0.14.1 should be released on Friday) so after that |
@JanSchulz updated with a test for using cats and non-cats. it ends up expanding the output space to bascially be non-compressed again, but will see if that is an issue. |
Doc: Add Release notes for pandas-dev#7217
@jreback please cherry-pick jankatins@b96cf3c (documentation updates in basics.rst for the new select_dtypes method) |
@JanSchulz done |
GH3943, GH5313, GH5314, GH7444 ENH: delegate _reduction and ops from Series to the categorical to support min/max and raise TypeError on other ops (numerical) and reduction Add Categorical Properties to Series Default to 'ordered' Categoricals if values are ordered Categorical: add level assignments and reordering + changed default for ordered Add a `Categorical.reorder_levels()` method. Change some naming in `Series`, so that the methods do not clash with established standards and rename the other categorical methods accordingly. Also change the default for `ordered` to True if values + levels are passed in at creation time. Initial doc version for working with Categorical data Categorical: add Categorical.mode() and use that in Series.mode() Categorical: implement remove_unused_levels() Categorical: implement value_count() for categorical series Categorical: make Series.astype("category") work ENH: add setitem to Categorical BUG: assigning to levels not in level set now raises ValueError API: disallow numpy ufuncs with categoricals Categorical: Categorical assignment to int/obj column ENH: add support for fillna to Categoricals API: deprecate old style categorical constructor usage and change default Before it was possible to pass in precomputed labels/pointer and the corresponding levels (e.g.: `Categorical([0,1,2], levels=["a","b","c"])`). This could lead to subtle errors in case of integer categoricals: the following could be both interpreted as "precomputed pointers and levels" or "values and levels", but converting it back to a integer array would result in different arrays: `np.array(Categorical([1,2], levels=[1,2,3]))` interpreted as pointers: `[2,3]` interpreted as values: `[1,2]` Up to now we would favour old style "pointer and levels" if these values could be interpreted as such (see code for details...). With this commit we favour new style "values and levels" and only attempt to interprete them as "pointers and levels" if "compat=True" is passed to the constructor. BREAKS: This will break code which uses Categoricals with "pointer and levels". A short google search and a search on stackoverflow revealed no such useage. Categorical: document constructor changes and small fixes Categorical: document that inappropriate numpy functions won't work anymore ENH: concat support
Doc: Add Release notes for pandas-dev#7217 DOC: update v0.15.0 notes Categorical: .codes should be immutable ERR: codes modification raises ValueError always Categorical: use Categorical.from_codes() in a few places Categorical: Fix assigning a Categorical to an existing string column CLN: CategoricalDtype repr now yields category DISPLAY: show dtype when displaying Categorical series (for consistency) BUG: fix groupby with multiple non-compressed categoricals Categorical: minor doc cleanups ENH: add a metaclass to CategoricalDtype to provide issubclass support (for select_dtypes) TST: io/pytables.py tests now raise NotImplementedError for dtype==category DOC: document the new category dtype in select_dtypes
@JanSchulz I just rebased this on current master. I think this is ready for merging. I am sure will have to do a follow-up for doc fixes / clarifications. But merging makes sense sooner rather than later (so it can be beat up a bit in master). ok? |
Yep, I will look out for categorical bugs and try to handle them. |
WIP: categoricals as an internal CategoricalBlock GH5313
@JanSchulz thanks for this |
add to the list: link from the whatsnew categorical changes section to the docs is broken |
might want to move this to cat section http://pandas-docs.github.io/pandas-docs-travis/reshaping.html#computing-indicator-dummy-variables (and/or provide a link) (maybe not the get_dummies but the factorization section) |
I think maybe move categorical to right after reshaping? |
Another thing. change ordered default in |
sure....go ahead an do a new PR for these items......(don't add to the old one). |
@jreback re Docs and factorization: IMO, this should stay there and only gain a new para to link to the categorical docs and an example how to get the same information from a categorical. Factorizations probably has it's uses without using a full Idea for the text:
|
ok that's fine, though definitily links back-forth would be good (e.g. use of Also prob need to add entires/links to |
This PR creates a
CategoricalBlock
as a first class internal object, on par with blocks likeDatetime,Timedelta,Object,Numeric
etc.closes #5313
closes #5314
closes #3943
TODOS
select_dtypes
ENH: select_dypes impl #7434, jreback@a43c6c0Code Changes
factor_agg/group_agg
incore/frame.py
-> look at unittests // a google search didn't turn up any questions/usage -> remove them?Series(categorical).describe()
/Categorical.unique()
-> should this return all levels or only used levels?categorical.T
not implementedCategory.describe()
with empty levels (will be fixed with groupby)_pointers
,_level_idx
,_level_pointer
)Documentation
Future