-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
crosstab is not respecting filtered scope when dtype=category #12298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@rumbin : I don't think there's anything wrong with the table output IMO. It is correctly reflecting that there are no crossed-tabbed factors with either 'c' or '3'. In addition, filtering out values does not change the |
I should also add that in your first example, the |
this issue is related to this: #10772
|
@jreback : How does the |
if we accept the premise that we don't want to always represent categorical results (IOW we treat them like regular results), then a filtered row should be dropped. |
@jreback : And by dropped, you mean not represented in the crosstab, even though the data is categorical? |
In my eyes, from the user point of view, the dtype representation should not influence the resulting crosstab. After filtering the original table, some values are gone. So they should not be represented in any crosstab that we perform on a filtered table, regardless of what the categories are. In my particular case, I am using categorial dtypes just for the sake of reduced RAM consumption and improved speed. Without categories I would easily end up swapping to disk every now and then. @gfyoung: I understand, that I am sort of abusing the concept of categorical data here, if we consider categories being something like, e.g., selling features, which typically consist of a handful, rarely changing members. Conclusion: In my eyes, the resulting crosstab should by default not depend on the dtype, as an occasional user will certainly not be aware of the sublte differences. |
@rumbin ok with that approach. This is essentially a 2-d value_counts; the difference is that want to do a PR? |
I'm afraid, the damage of any coding help of mine would be greater than any benefit, as I am fairly new to python (1-2 month) and my further programming "background" is not more than bash... |
not a bad way to learn! heres the contribution guide to get started |
Closes pandas-devgh-12298. [ci skip]
IMO this is again a argument to get |
IMO "Categories" are more similar to "ints" than "strings": the whole range of possible values has a meaning, not only individual values (at the least when there is some order between these values). In a histogram you wouldn't leave out some ranges just because they are empty and similarly you shouldn't leave out empty categories. |
I agree with @JanSchulz BTW, for this, there is the |
@jorisvandenbossche : Didn't know that! Good to know for future reference! |
First of all, I have to admit that I have some trouble finding a catchy title for this issue. Better suggestions are welcome...
There is some issue with crosstab, when applied on a DataFrame which has been derived from a DataFrame consisting of categorized columns.
Let's consider the following example df:
We apply some filter according to some criterion which happens to eliminate all occurrences of 'c' in 'col0' (and '3' in 'col1'). Then we build a crosstab:
The result is as expected:
Now we try the same on categorized columns:
In this case, value 'c' and '3' are again listed, although they shouldn't be:
My guess is that the reason for this behavior can be found in the lookup-table of the category, which does not account for values that are not represented any longer:
df_filtered.col0.cat.categories
Output:
Index(['a', 'b', 'c'], dtype='object'
So, either this lookup-table has to be fixed upon filtering, or pd.crosstab has to respect this condition.
The text was updated successfully, but these errors were encountered: