-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Zero counts in Series.value_counts for categoricals #8559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can u show a specific example (of what u think it should do) |
cc @JanSchulz |
s = pd.Series(['a','b','a','c','d','c'])
count_str = s[s.isin(['a','u'])].value_counts()
count_cat = s.astype('category')[s.isin(['a','u'])].value_counts()
|
.. and yes version is 0.15.0rc1-24-g56dbb8c |
I think the current behaviour is correct: A categorical is not a more memory efficient string dtype but a dtype with a fixed set of values. One of the main points for categoricals is that "unused" categories show up in all kind of operations, e.g. during groupby and during value counts. This will come in handy in ggplot, where plot axis should be the same for all facets and unused cats should show up with length zero bars. If you want to have the same output, you need to do the "isin" with the results of
|
IMO, it's also consistent, as 'value_count' counts every value it knows about and in case of categoricals it knows that there are more than only the "used" categories. |
what about adding the dropna=False arg? |
|
@fkaufer What is actually the use case here, i.e., why do you need a Categorical and zero-cats removed? |
|
Re plotting: having all values preserved in all facets (so not
Interestingly unique returns a factor (with all levels, but only the "used" levels as values) when the input is a factor:
This is IMO a argument to drop unused categories in As a workaround for your seaborn problem, you can use Re value_count:
I think a Or you have a "remove unused categories" step in between... What will happen in your app when you reorder the categories (e.g., "one" < "two" < "three")? Re metadata: I see it that the levels are actually part of every item of the categorical data (which it is in R, but right now not in pandas: getting a single item will return a single item cat in R but a int/string/... in pandas):
From your metadata comment and the last bullet I think what you want is a memory efficient string dtype. This could actually be done by subclassing categorical and "hiding" the categorical thingies and add categories during set automatically. Should actually be almost trivial... This was actually one argument to implement such a data type in numpy so that they have a proper variable length string dtype :-) => I see the problem with |
Oh my:
-> dplyr omits unused levels in group_by Following this would mean that pandas groupby should also not return empty (unused) categories... @hadley is that intentional? |
For the |
Ok, I will prepare a PR for the unique case. What about the rest? removing empty groups from groupby will be deeper than the unique change... |
@JanSchulz IIRC we specifically made the groupby return ALL of the categorical groups (I like this and think this makes sense). Unique I suppose is a different issue, though (and I agree with the above). |
so I'd like to ask if we think their are actually 2 'categorical' types:
These seem really close (and in fact we don't distinguish these), should we? |
The biggest difference: how "not in categories" values are handled: e.g when using concat or setting new values. |
Just to clarify: I do not only (mis)use categoricals for memory efficiency of string variables. But this is - along the custom ordering - something I get right out of box now, where as for the other benefits (signalling for stats/ML, plotting) it will take some time until the respective libs directly support pandas categoricals. That said, I don't think there should be two different categorical types. I guess the difference in our views on categoricals is rather a matter of the size (cardinality) of the categorical. To me it seems the current design is for categoricals of small cardinality, rather coming from boolean vars. In this cases I can understand your take on value_counts and plotting. I have these categorials as well but I have many categoricals of cardinality in the order of tens, hundreds and even thousands. To me that makes perfectly sense and I consider them "real categoricals" (IMO the main strong criterion to qualify as a "real categorical" is the fixed range of values). Plotting diagrams with these large-cardinalities categoricals typically means you have applied some filtering before which has virtually decreased the cardinality, hence plotting zero-length bars and showing zero frequencies is really not what you want. Probably I would even use dropna=False more often then dropzero=False. Personally, I would rather suggest to have a separate new method "levels", "tabulate" or "cat_freq" and keep existing methods (unique, value_counts, groupby, ...) consistent with other data types. Such a new method could then also be applied to all dtypes. Similar to the meta-data perspective I also like to consider categoricals as separate dimension/lookup tables as in databases with the built-in feature of being auto-joined whenever I use the categorical for projection (i.e. SELECT in SQL) or certain selections (WHERE cat=scalar). For a database query you would then apply a left or inner/natural join (for real categories with fixed values left and inner join are the same), which is also the default behaviour of pandas' merge (default: inner) and join (default: left). The current behaviour of value_counts et al corresponds to a full/right outer join (analogous to left/inner: full and right outer join are equivalent for real categoricals) which feels similarly unnatural as if pandas' join/merge default would be set to full/right outer join. |
@fkaufer join behavior will have to think about but all for consistency @JanSchulz can u prepare a pr for reverting I think this just means honoring the dropna in unique, value_counts |
@JanSchulz it's on the long term to do list. |
@hadley Just that I understand it correctly: you plan to change group_by to include empty levels? |
@fkaufer Re your app: If you use custom ordering and therefore special cases categorical data, then it wouldn't matter to use a Re categorical and "metadata": I don't see them as a "join operation between codes and categories", but as a new data type which only can take a few values (like you can't put a longer than max-int into a int array). As such each individual entry consists of "value and metadata" the same as an int is "value and metadata", only that in the int case the metadata is encoded in the length of the memory block which is used to store the int. They are "just" implemented like a database join... |
@jreback The unique case is IMO smaller, as it only takes a few lines in |
going to move this to 0.15.1 |
ok, so to summarise:
@JanSchulz you are doing a PR for |
Unique should be easy, just do the unique on the codes and then |
@JanSchulz can you revisit. See what we need from this issue. |
I believe that the output of value_counts when applied to categorical variables, shouldn't print values that are inexistent/not assigned for that variable in the current dataframe. This issue would benefit a lot Pandas users in data Analysis. Do we have an estimation of when this improvement will be included? Thanks! |
In the absence of clear guidance about whether to change anything, I'm closing this as Won't Fix. Note: R maintains the empty categories when tabulating factor counts
|
we can avoid the series zero by simply : |
also necroposting... if you need to drop categoricals with zero counts, do
|
Series.value_counts()
also shows categories with count 0.Thought this would be a bug but according to doc it is intentional.
This makes the output of value_counts inconsistent when switching between category and non-category dtype. Apart from that it blows up the value_counts output for series with many categories.
I would prefer to hide counts (i.e. zero) for non-occuring categories by default and rather consider a parameter
dropzero=True
similar todropna
(see also #5569).The text was updated successfully, but these errors were encountered: