Skip to content

crosstab is not respecting filtered scope when dtype=category #12298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rumbin opened this issue Feb 11, 2016 · 15 comments
Closed

crosstab is not respecting filtered scope when dtype=category #12298

rumbin opened this issue Feb 11, 2016 · 15 comments
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@rumbin
Copy link

rumbin commented Feb 11, 2016

First of all, I have to admit that I have some trouble finding a catchy title for this issue. Better suggestions are welcome...

There is some issue with crosstab, when applied on a DataFrame which has been derived from a DataFrame consisting of categorized columns.

Let's consider the following example df:

import pandas as pd
df = pd.DataFrame({'col0':list('abbabcabcab'), 'col1':[1,1,1,1,2,3,2,2,3,3,1],
                  'crit':[1,1,1,1,1,0,0,0,0,0,0]})

We apply some filter according to some criterion which happens to eliminate all occurrences of 'c' in 'col0' (and '3' in 'col1'). Then we build a crosstab:

df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1)

The result is as expected:

col1    1   2
col0        
a   2   0
b   2   1

Now we try the same on categorized columns:

for col in df.columns:
    df[col] = df[col].astype('category')
df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1

In this case, value 'c' and '3' are again listed, although they shouldn't be:

col1    1   2   3
col0            
a   2   0   0
b   2   1   0
c   0   0   0

My guess is that the reason for this behavior can be found in the lookup-table of the category, which does not account for values that are not represented any longer:
df_filtered.col0.cat.categories
Output:
Index(['a', 'b', 'c'], dtype='object'

So, either this lookup-table has to be fixed upon filtering, or pd.crosstab has to respect this condition.

@gfyoung
Copy link
Member

gfyoung commented Feb 12, 2016

@rumbin : I don't think there's anything wrong with the table output IMO. It is correctly reflecting that there are no crossed-tabbed factors with either 'c' or '3'. In addition, filtering out values does not change the dtype of the individual columns. That is why you still see the 'c' and '3'.

@gfyoung
Copy link
Member

gfyoung commented Feb 12, 2016

I should also add that in your first example, the dtype is just int or object so c is not an inherent part of the dtype in any of the columns, whereas that is not the case for your second example.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

this issue is related to this: #10772

dropna=True is the default and I think should be respected here.
Though one could make a case that since its a categorical we should include all of the original categories (as that's kind of the point).

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type Difficulty Intermediate labels Feb 12, 2016
@jreback jreback added this to the Next Major Release milestone Feb 12, 2016
@gfyoung
Copy link
Member

gfyoung commented Feb 12, 2016

@jreback : How does the dropna arg relate to this issue though?

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

if we accept the premise that we don't want to always represent categorical results (IOW we treat them like regular results), then a filtered row should be dropped.

@gfyoung
Copy link
Member

gfyoung commented Feb 12, 2016

@jreback : And by dropped, you mean not represented in the crosstab, even though the data is categorical?

@rumbin
Copy link
Author

rumbin commented Feb 12, 2016

In my eyes, from the user point of view, the dtype representation should not influence the resulting crosstab. After filtering the original table, some values are gone. So they should not be represented in any crosstab that we perform on a filtered table, regardless of what the categories are.

In my particular case, I am using categorial dtypes just for the sake of reduced RAM consumption and improved speed. Without categories I would easily end up swapping to disk every now and then.
So there are even columns of measurement values represented by categories.
After creating some crosstabs on filtered tables, it took me quite long to actually find out, why there are columns/rows wth all zeros. Looking up the respective column labels in the filtered tables leads to no results, so why are they listed in those crosstabs? This is definitely not what I as a user expected.

@gfyoung: I understand, that I am sort of abusing the concept of categorical data here, if we consider categories being something like, e.g., selling features, which typically consist of a handful, rarely changing members.
In my case, categories can easily consist of several thousand members (e.g. serial numbers or part numbers), where the filtering operation is crucial for concentrating just on a small subset withing these categories. But when the crosstab does not respect the previously applied filtering operation, this is sort of useless.

Conclusion: In my eyes, the resulting crosstab should by default not depend on the dtype, as an occasional user will certainly not be aware of the sublte differences.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

@rumbin ok with that approach. This is essentially a 2-d value_counts; the difference is that .value_counts has an index, here it does not.

want to do a PR?

@rumbin
Copy link
Author

rumbin commented Feb 12, 2016

I'm afraid, the damage of any coding help of mine would be greater than any benefit, as I am fairly new to python (1-2 month) and my further programming "background" is not more than bash...

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

not a bad way to learn! heres the contribution guide to get started

gfyoung added a commit to forking-repos/pandas that referenced this issue May 12, 2016
@jankatins
Copy link
Contributor

In my case, categories can easily consist of several thousand members (e.g. serial numbers or part numbers)

IMO this is again a argument to get pd.String() and not use Category...

@jankatins
Copy link
Contributor

Conclusion: In my eyes, the resulting crosstab should by default not depend on the dtype, as an occasional user will certainly not be aware of the subtle differences.

IMO "Categories" are more similar to "ints" than "strings": the whole range of possible values has a meaning, not only individual values (at the least when there is some order between these values). In a histogram you wouldn't leave out some ranges just because they are empty and similarly you shouldn't leave out empty categories.

@jorisvandenbossche
Copy link
Member

I agree with @JanSchulz

BTW, for this, there is the .cat.remove_unused_categories() method to explicitly get only the used categories.

@gfyoung
Copy link
Member

gfyoung commented May 12, 2016

@jorisvandenbossche : Didn't know that! Good to know for future reference!

@jreback
Copy link
Contributor

jreback commented May 13, 2016

closing, but if @rumbin of @gfyoung (or anyone else!) wants to submit a documentation PR for an example (in the crosstab section, maybe cross-linking to categorical), would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants