Skip to content

BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rosnfeld opened this issue Aug 8, 2015 · 19 comments · Fixed by #53205
Closed

BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772

rosnfeld opened this issue Aug 8, 2015 · 19 comments · Fixed by #53205
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@rosnfeld
Copy link
Contributor

rosnfeld commented Aug 8, 2015

If one uses series.value_counts(dropna=False) it includes NaN values, but crosstab(series1, series2, dropna=False) does not include NaN values. I think of cross_tab kind of like 2-dimensional value_counts(), so this is surprising.

In [31]: x = pd.Series(['a', 'b', 'a', None, 'a'])

In [32]: y = pd.Series(['c', 'd', 'd', 'c', None])

In [33]: y.value_counts(dropna=False)
Out[33]: 
d      2
c      2
NaN    1
dtype: int64

In [34]: pd.crosstab(x, y, dropna=False)
Out[34]: 
       c  d
row_0      
a      1  1
b      0  1

I believe what crosstab(..., dropna=False) really does is just include rows/columns that would otherwise have all zeros, but that doesn't seem the same as dropna=False to me. I would have expected to see a row and column entry for NaN, something like:

       c  d  NaN
row_0      
a      1  1    1
b      0  1    0
NaN    1  0    0

Thoughts on this?

This is on current master (almost 0.17).

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Aug 11, 2015
@jreback jreback added this to the Next Major Release milestone Aug 11, 2015
@jreback
Copy link
Contributor

jreback commented Aug 11, 2015

can you show what you would expect as ouptput?

@rosnfeld
Copy link
Contributor Author

@jreback I updated the original description to include the expected output.

This further example (a slightly modified version of one from the crosstab docs) shows what dropna currently does:

In [69]: a = np.array(['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'])

In [70]: b = np.array(['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'one', 'one', 'one'])

In [71]: c = np.array(['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', 'dull', 'shiny', 'shiny', 'shiny'])

In [72]: pd.crosstab(a, [b, c], dropna=True)
Out[72]: 
col_0  one        two
col_1 dull shiny dull
row_0                
bar      1     2    1
foo      2     4    1

In [73]: pd.crosstab(a, [b, c], dropna=False)
Out[73]: 
       one        two      
      dull shiny dull shiny
row_0                      
bar      1     2    1     0
foo      2     4    1     0

I think we need a different name for this behavior. drop_empty? drop_all_zero? (or perhaps include_all_zero=False) Is there any naming precedent set elsewhere in the codebase?

Or perhaps the behavior I am looking for should have the new name, like drop_na_values... though really I think the connection with the existing value_counts(dropna=...) behavior is fairly strong.

@rosnfeld
Copy link
Contributor Author

My R is very rusty, but it does support this feature, just for reference of how another system does it. (I actually don't know how to do in R what pandas is currently doing with dropna, where it includes all Cartesian product variants of a multi-level index.) And R even more explicitly matches the behavior between value_counts() and cross_tab() since you can use the same R function, table, for both.

> x <- c('a', 'b', 'a', NA, 'a')
> y <- c('c', 'd', 'd', 'c', NA)
> table(x)
x
a b 
3 1 
> # one way for including NA's
> table(x, exclude=NULL)
x
   a    b <NA> 
   3    1    1 
> # another way for including NA's
> table(x, useNA='ifany')
x
   a    b <NA> 
   3    1    1 
> table(x, y)
   y
x   c d
  a 1 1
  b 0 1
> table(x, y, useNA='ifany')
      y
x      c d <NA>
  a    1 1    1
  b    0 1    0
  <NA> 1 0    0

@rosnfeld
Copy link
Contributor Author

I keep running into this. Hopefully I will have time to look at it soon, or it will get fixed as a result of some of the missing-value-handling issues that people are hoping to resolve with a change to pandas underlying data representations.

Here's a related issue (could perhaps be its own issue). Set up x and y as above, and then try and use crosstab with margins:

In [8]: pd.crosstab(x, y, margins=True)
Out[8]: 
col_0  c  d  All
row_0           
a      1  1    3
b      0  1    1
All    2  2    5

In [9]: # huh that's weird, the columns don't sum

In [10]: pd.crosstab(x, y, dropna=False, margins=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
   4417                 raise ValueError('The name %s occurs multiple times, use a '
-> 4418                                  'level number' % level)
   4419             level = self.names.index(level)

ValueError: The name None occurs multiple times, use a level number

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-b11080c1dcf8> in <module>()
----> 1 pd.crosstab(x, y, dropna=False, margins=True)

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, dropna)
    426         df['__dummy__'] = 0
    427         table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
--> 428                                aggfunc=len, margins=margins, dropna=dropna)
    429         return table.fillna(0).astype(np.int64)
    430     else:

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
    145         table = _add_margins(table, data, values, rows=index,
    146                              cols=columns, aggfunc=aggfunc,
--> 147                              margins_name=margins_name)
    148 
    149     # discard the top level

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in _add_margins(table, data, values, rows, cols, aggfunc, margins_name)
    175     if hasattr(table, 'columns'):
    176         for level in table.columns.names[1:]:
--> 177             if margins_name in table.columns.get_level_values(level):
    178                 raise ValueError(exception_msg)
    179 

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in get_level_values(self, level)
   4565         values : ndarray
   4566         """
-> 4567         num = self._get_level_number(level)
   4568         unique = self.levels[num]  # .values
   4569         labels = self.labels[num]

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
   4420         except ValueError:
   4421             if not isinstance(level, int):
-> 4422                 raise KeyError('Level %s not found' % str(level))
   4423             elif level < 0:
   4424                 level += self.nlevels

KeyError: 'Level None not found'

@nickeubank
Copy link
Contributor

Agreed. @rosnfeld, if I follow right, you think the behavior should be that:

df = pd.DataFrame({'a': [np.nan, np.nan, np.nan], 'b': [3, 3, 4]})
pd.crosstab(df.a, df.b, dropna=False)

returns:

      3  4 
a        
NaN   2  1  

Not an empty dataframe?

@rosnfeld
Copy link
Contributor Author

rosnfeld commented Apr 4, 2016

Correct. That's similar to what R would do with useNA='ifany':

> a <- c(NA, NA, NA)
> b <- c(3, 3, 4)
> table(a, b)
< table of extent 0 x 2 >
> table(a, b, useNA='ifany')
      b
a      3 4
  <NA> 2 1

@nickeubank
Copy link
Contributor

It's also what Stata does if one uses the option ", missing"
On Mon, Apr 4, 2016 at 1:15 AM rosnfeld [email protected] wrote:

Correct. That's similar to what R would do with useNA='ifany':

a <- c(NA, NA, NA)
b <- c(3, 3, 4)
table(a, b)
< table of extent 0 x 2 >
table(a, b, useNA='ifany')
b
a 3 4
2 1


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#10772 (comment)

@XikunHan
Copy link

dropna : boolean, default True
Do not include columns whose entries are all NaN

Only works when the whole column is NaN, but if some values are NaN, cannot count them

@oriolmirosa
Copy link

This also strikes me as odd and inconsistent, and I'd love to see a fix. Meanwhile, the way I deal with it is by adding .fillna(' NaN) to the variables called in pd.crosstab(). For instance, in the example above:

pd.crosstab(df.a.fillna(' NaN), df.b.fillna(' NaN), dropna=False)

Of course, you could omit the .fillna(' NaN') from the column with no null values. I add a space before NaN so that the NaN row and/or column appear first. Otherwise, they would be ordered alphabetically and probably be in the middle of the rows/columns in the output (I guess you could prepend it by a z to put it as the last row/column, but that looks a little silly).

@ceilingcat
Copy link

I also would like to see this fixed.

@TomAugspurger
Copy link
Contributor

@ceilingcat let us know if you need help starting a PR.

@dsaxton
Copy link
Member

dsaxton commented Oct 24, 2018

I agree this would make a nice improvement, especially since the fillna fix doesn't really work for categorical columns since the fill value needs to be an available level.

@pauljohn32
Copy link

pauljohn32 commented Apr 9, 2019

Do I have just another example of same problem? Pandas 0.24

I converted some elements in x to missings as `y``,

In [253]: x
Out[253]: 
0     1
1     2
2     3
3     4
4     5
5    55
6    22
7    44
8    12
9    55
dtype: int64

In [254]: y1
Out[254]: 
0     1.0
1     2.0
2     NaN
3     4.0
4     5.0
5    55.0
6     NaN
7    44.0
8     NaN
9    55.0
dtype: float64

In [255]: y1.value_counts(dropna=False)
Out[255]: 
NaN     3
55.0    2
44.0    1
5.0     1
4.0     1
2.0     1
1.0     1
dtype: int64

In [256]: pd.crosstab(x, y1, dropna=False)
Out[258]: 
col_0  1.0   2.0   4.0   5.0   44.0  55.0
row_0                                    
1         1     0     0     0     0     0
2         0     1     0     0     0     0
4         0     0     1     0     0     0
5         0     0     0     1     0     0
44        0     0     0     0     1     0
55        0     0     0     0     0     2

I expect to see a column labeled "NaN" with values

        1.0   2.0   4.0  5.0  44.0  55.0  NaN
1         1     0     0     0     0     0          0
3         0     0     0     0     0     0          1
2         0     1     0     0     0     0          0
4         0     0     1     0     0     0          0
5         0     0     0     1     0     0          0
12       0     0     0     0     0     0          1
22       0     0     0     0     0     0          1
44       0     0     0     0     1     0          0
55       0     0     0     0     0     2          0

@kflamm
Copy link

kflamm commented May 20, 2019

The current (0.24) Pandas documentation should say dropna: "Do not include columns OR ROWS whose entries are all NaN", because that is what the current behavior actually seems to be: when rows/columns are entirely empty, rows/columns are dropped with default dropna = True.

With dropna=False, empty rows show up as 0 value entries in columns, and empty columns show up as 0 values in rows. Previous example shows that NaN is apparently not considered to be a possible row/column value with dropna=False.

It seems reasonable to suggest that NaN be added to the set of possible values for both rows and columns in the dataframe created when crosstab has dropna=False set by a user. That would replicate behavior in other statistical software (e.g., Stata when tab used with "missing" option). More importantly, using this option necessarily means you are interested in understanding the joint distribution of NaNs in the two series.

Current behavior of crosstab('y_variable','x_variable',dropna=False) appears to be equivalent to groupby('y_variable').agg({'x_variable':'count'}) with a binary dummy x variable.

To preserve this equivalence, it may be worthwhile to instead consider a switch adding NaN to the sets of tabulated values in a crosstab rather than changing behavior with dropna=False.

@russellpierce
Copy link

russellpierce commented Aug 7, 2019

Is there a reasonably clean way to get the desired output within the context of Pandas as it is today?

@mroeschke mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug and removed API Design labels Apr 18, 2021
@buhtz
Copy link

buhtz commented May 18, 2022

Is there any planing or roadmap involving this Issue?

@MarcoGorelli
Copy link
Member

@Codeberg-AsGithubAlternative-buhtz it doesn't look like anyone is currently working on this, but you're welcome to take it on if you'd like to - check the contributing guide for how to get started

I'm inclined to agree that it'd be an improvement to make this consistent with the dropna in value_counts

@buhtz
Copy link

buhtz commented May 19, 2022

@MarcoGorelli Please stop blaming others to not contributing. On my side this feels inpolite.

I haven't asked about who is working on this or when is it done. The Issue comments indicating that currently there is no decision on site of the maintainers about how a solution could look like. Because of that I asked for planing and roadmap. But I understand that someone could understand that terms in the context of timeline. But I meant it more in a context of design decisions.

@MarcoGorelli
Copy link
Member

Hi @Codeberg-AsGithubAlternative-buhtz

I didn't blame you, it just sounded like you were interested in working on this

Are you asking about whether there's agreement that the expected output from the original post is the one to aim for?

If so, then I agree that it would be an improvement. Seems like @dsaxton agrees too, and @TomAugspurger has welcomed a PR above.
I presume @jreback agrees too, as the issue hasn't been closed and he's added labels to it

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.