crosstab is not respecting filtered scope when dtype=category #12298

rumbin · 2016-02-11T22:10:34Z

First of all, I have to admit that I have some trouble finding a catchy title for this issue. Better suggestions are welcome...

There is some issue with crosstab, when applied on a DataFrame which has been derived from a DataFrame consisting of categorized columns.

Let's consider the following example df:

import pandas as pd
df = pd.DataFrame({'col0':list('abbabcabcab'), 'col1':[1,1,1,1,2,3,2,2,3,3,1],
                  'crit':[1,1,1,1,1,0,0,0,0,0,0]})

We apply some filter according to some criterion which happens to eliminate all occurrences of 'c' in 'col0' (and '3' in 'col1'). Then we build a crosstab:

df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1)

The result is as expected:

col1    1   2
col0        
a   2   0
b   2   1

Now we try the same on categorized columns:

for col in df.columns:
    df[col] = df[col].astype('category')
df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1

In this case, value 'c' and '3' are again listed, although they shouldn't be:

col1    1   2   3
col0            
a   2   0   0
b   2   1   0
c   0   0   0

My guess is that the reason for this behavior can be found in the lookup-table of the category, which does not account for values that are not represented any longer:
df_filtered.col0.cat.categories
Output:
Index(['a', 'b', 'c'], dtype='object'

So, either this lookup-table has to be fixed upon filtering, or pd.crosstab has to respect this condition.

The text was updated successfully, but these errors were encountered:

gfyoung · 2016-02-12T00:16:45Z

@rumbin : I don't think there's anything wrong with the table output IMO. It is correctly reflecting that there are no crossed-tabbed factors with either 'c' or '3'. In addition, filtering out values does not change the dtype of the individual columns. That is why you still see the 'c' and '3'.

gfyoung · 2016-02-12T00:23:05Z

I should also add that in your first example, the dtype is just int or object so c is not an inherent part of the dtype in any of the columns, whereas that is not the case for your second example.

jreback · 2016-02-12T00:43:05Z

this issue is related to this: #10772

dropna=True is the default and I think should be respected here.
Though one could make a case that since its a categorical we should include all of the original categories (as that's kind of the point).

gfyoung · 2016-02-12T00:48:05Z

@jreback : How does the dropna arg relate to this issue though?

jreback · 2016-02-12T01:00:26Z

if we accept the premise that we don't want to always represent categorical results (IOW we treat them like regular results), then a filtered row should be dropped.

gfyoung · 2016-02-12T01:09:23Z

@jreback : And by dropped, you mean not represented in the crosstab, even though the data is categorical?

rumbin · 2016-02-12T05:58:24Z

In my eyes, from the user point of view, the dtype representation should not influence the resulting crosstab. After filtering the original table, some values are gone. So they should not be represented in any crosstab that we perform on a filtered table, regardless of what the categories are.

In my particular case, I am using categorial dtypes just for the sake of reduced RAM consumption and improved speed. Without categories I would easily end up swapping to disk every now and then.
So there are even columns of measurement values represented by categories.
After creating some crosstabs on filtered tables, it took me quite long to actually find out, why there are columns/rows wth all zeros. Looking up the respective column labels in the filtered tables leads to no results, so why are they listed in those crosstabs? This is definitely not what I as a user expected.

@gfyoung: I understand, that I am sort of abusing the concept of categorical data here, if we consider categories being something like, e.g., selling features, which typically consist of a handful, rarely changing members.
In my case, categories can easily consist of several thousand members (e.g. serial numbers or part numbers), where the filtering operation is crucial for concentrating just on a small subset withing these categories. But when the crosstab does not respect the previously applied filtering operation, this is sort of useless.

Conclusion: In my eyes, the resulting crosstab should by default not depend on the dtype, as an occasional user will certainly not be aware of the sublte differences.

jreback · 2016-02-12T16:00:38Z

@rumbin ok with that approach. This is essentially a 2-d value_counts; the difference is that .value_counts has an index, here it does not.

want to do a PR?

rumbin · 2016-02-12T16:24:02Z

I'm afraid, the damage of any coding help of mine would be greater than any benefit, as I am fairly new to python (1-2 month) and my further programming "background" is not more than bash...

jreback · 2016-02-12T16:32:58Z

not a bad way to learn! heres the contribution guide to get started

Closes pandas-devgh-12298. [ci skip]

jankatins · 2016-05-12T22:10:43Z

In my case, categories can easily consist of several thousand members (e.g. serial numbers or part numbers)

IMO this is again a argument to get pd.String() and not use Category...

jankatins · 2016-05-12T22:19:48Z

Conclusion: In my eyes, the resulting crosstab should by default not depend on the dtype, as an occasional user will certainly not be aware of the subtle differences.

IMO "Categories" are more similar to "ints" than "strings": the whole range of possible values has a meaning, not only individual values (at the least when there is some order between these values). In a histogram you wouldn't leave out some ranges just because they are empty and similarly you shouldn't leave out empty categories.

jorisvandenbossche · 2016-05-12T22:36:12Z

I agree with @JanSchulz

BTW, for this, there is the .cat.remove_unused_categories() method to explicitly get only the used categories.

gfyoung · 2016-05-12T22:38:22Z

@jorisvandenbossche : Didn't know that! Good to know for future reference!

jreback · 2016-05-13T00:21:24Z

closing, but if @rumbin of @gfyoung (or anyone else!) wants to submit a documentation PR for an example (in the crosstab section, maybe cross-linking to categorical), would be great.

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type Difficulty Intermediate labels Feb 12, 2016

jreback added this to the Next Major Release milestone Feb 12, 2016

gfyoung mentioned this issue May 3, 2016

BUG: Respect Filtered Categorical in Crosstab #13073

Closed

gfyoung added a commit to forking-repos/pandas that referenced this issue May 12, 2016

BUG: Respect filtered Categorical in crosstab

31db27e

Closes pandas-devgh-12298. [ci skip]

jreback closed this as completed May 13, 2016

jorisvandenbossche removed Difficulty Intermediate labels May 13, 2016

jorisvandenbossche modified the milestones: No action, Next Major Release May 13, 2016

jreback mentioned this issue Jan 23, 2017

pivot_table over Categorical columns #15193

Closed

TomAugspurger mentioned this issue May 16, 2017

pd.crosstab, categorical data and missing instances #16367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crosstab is not respecting filtered scope when dtype=category #12298

crosstab is not respecting filtered scope when dtype=category #12298

rumbin commented Feb 11, 2016

gfyoung commented Feb 12, 2016

gfyoung commented Feb 12, 2016

jreback commented Feb 12, 2016

gfyoung commented Feb 12, 2016

jreback commented Feb 12, 2016

gfyoung commented Feb 12, 2016

rumbin commented Feb 12, 2016

jreback commented Feb 12, 2016

rumbin commented Feb 12, 2016

jreback commented Feb 12, 2016

jankatins commented May 12, 2016

jankatins commented May 12, 2016

jorisvandenbossche commented May 12, 2016

gfyoung commented May 12, 2016

jreback commented May 13, 2016

crosstab is not respecting filtered scope when dtype=category #12298

crosstab is not respecting filtered scope when dtype=category #12298

Comments

rumbin commented Feb 11, 2016

gfyoung commented Feb 12, 2016

gfyoung commented Feb 12, 2016

jreback commented Feb 12, 2016

gfyoung commented Feb 12, 2016

jreback commented Feb 12, 2016

gfyoung commented Feb 12, 2016

rumbin commented Feb 12, 2016

jreback commented Feb 12, 2016

rumbin commented Feb 12, 2016

jreback commented Feb 12, 2016

jankatins commented May 12, 2016

jankatins commented May 12, 2016

jorisvandenbossche commented May 12, 2016

gfyoung commented May 12, 2016

jreback commented May 13, 2016