BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772

rosnfeld · 2015-08-08T22:20:16Z

If one uses series.value_counts(dropna=False) it includes NaN values, but crosstab(series1, series2, dropna=False) does not include NaN values. I think of cross_tab kind of like 2-dimensional value_counts(), so this is surprising.

In [31]: x = pd.Series(['a', 'b', 'a', None, 'a'])

In [32]: y = pd.Series(['c', 'd', 'd', 'c', None])

In [33]: y.value_counts(dropna=False)
Out[33]: 
d      2
c      2
NaN    1
dtype: int64

In [34]: pd.crosstab(x, y, dropna=False)
Out[34]: 
       c  d
row_0      
a      1  1
b      0  1

I believe what crosstab(..., dropna=False) really does is just include rows/columns that would otherwise have all zeros, but that doesn't seem the same as dropna=False to me. I would have expected to see a row and column entry for NaN, something like:

       c  d  NaN
row_0      
a      1  1    1
b      0  1    0
NaN    1  0    0

Thoughts on this?

This is on current master (almost 0.17).

The text was updated successfully, but these errors were encountered:

jreback · 2015-08-11T10:43:21Z

can you show what you would expect as ouptput?

rosnfeld · 2015-08-15T19:41:16Z

@jreback I updated the original description to include the expected output.

This further example (a slightly modified version of one from the crosstab docs) shows what dropna currently does:

In [69]: a = np.array(['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'])

In [70]: b = np.array(['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'one', 'one', 'one'])

In [71]: c = np.array(['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', 'dull', 'shiny', 'shiny', 'shiny'])

In [72]: pd.crosstab(a, [b, c], dropna=True)
Out[72]: 
col_0  one        two
col_1 dull shiny dull
row_0                
bar      1     2    1
foo      2     4    1

In [73]: pd.crosstab(a, [b, c], dropna=False)
Out[73]: 
       one        two      
      dull shiny dull shiny
row_0                      
bar      1     2    1     0
foo      2     4    1     0

I think we need a different name for this behavior. drop_empty? drop_all_zero? (or perhaps include_all_zero=False) Is there any naming precedent set elsewhere in the codebase?

Or perhaps the behavior I am looking for should have the new name, like drop_na_values... though really I think the connection with the existing value_counts(dropna=...) behavior is fairly strong.

rosnfeld · 2015-08-16T08:40:37Z

My R is very rusty, but it does support this feature, just for reference of how another system does it. (I actually don't know how to do in R what pandas is currently doing with dropna, where it includes all Cartesian product variants of a multi-level index.) And R even more explicitly matches the behavior between value_counts() and cross_tab() since you can use the same R function, table, for both.

> x <- c('a', 'b', 'a', NA, 'a')
> y <- c('c', 'd', 'd', 'c', NA)
> table(x)
x
a b 
3 1 
> # one way for including NA's
> table(x, exclude=NULL)
x
   a    b <NA> 
   3    1    1 
> # another way for including NA's
> table(x, useNA='ifany')
x
   a    b <NA> 
   3    1    1 
> table(x, y)
   y
x   c d
  a 1 1
  b 0 1
> table(x, y, useNA='ifany')
      y
x      c d <NA>
  a    1 1    1
  b    0 1    0
  <NA> 1 0    0

rosnfeld · 2016-01-27T13:01:32Z

I keep running into this. Hopefully I will have time to look at it soon, or it will get fixed as a result of some of the missing-value-handling issues that people are hoping to resolve with a change to pandas underlying data representations.

Here's a related issue (could perhaps be its own issue). Set up x and y as above, and then try and use crosstab with margins:

In [8]: pd.crosstab(x, y, margins=True)
Out[8]: 
col_0  c  d  All
row_0           
a      1  1    3
b      0  1    1
All    2  2    5

In [9]: # huh that's weird, the columns don't sum

In [10]: pd.crosstab(x, y, dropna=False, margins=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
   4417                 raise ValueError('The name %s occurs multiple times, use a '
-> 4418                                  'level number' % level)
   4419             level = self.names.index(level)

ValueError: The name None occurs multiple times, use a level number

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-b11080c1dcf8> in <module>()
----> 1 pd.crosstab(x, y, dropna=False, margins=True)

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, dropna)
    426         df['__dummy__'] = 0
    427         table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
--> 428                                aggfunc=len, margins=margins, dropna=dropna)
    429         return table.fillna(0).astype(np.int64)
    430     else:

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
    145         table = _add_margins(table, data, values, rows=index,
    146                              cols=columns, aggfunc=aggfunc,
--> 147                              margins_name=margins_name)
    148 
    149     # discard the top level

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in _add_margins(table, data, values, rows, cols, aggfunc, margins_name)
    175     if hasattr(table, 'columns'):
    176         for level in table.columns.names[1:]:
--> 177             if margins_name in table.columns.get_level_values(level):
    178                 raise ValueError(exception_msg)
    179 

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in get_level_values(self, level)
   4565         values : ndarray
   4566         """
-> 4567         num = self._get_level_number(level)
   4568         unique = self.levels[num]  # .values
   4569         labels = self.labels[num]

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
   4420         except ValueError:
   4421             if not isinstance(level, int):
-> 4422                 raise KeyError('Level %s not found' % str(level))
   4423             elif level < 0:
   4424                 level += self.nlevels

KeyError: 'Level None not found'

nickeubank · 2016-04-03T23:46:07Z

Agreed. @rosnfeld, if I follow right, you think the behavior should be that:

df = pd.DataFrame({'a': [np.nan, np.nan, np.nan], 'b': [3, 3, 4]})
pd.crosstab(df.a, df.b, dropna=False)

returns:

      3  4 
a        
NaN   2  1

Not an empty dataframe?

rosnfeld · 2016-04-04T08:15:17Z

Correct. That's similar to what R would do with useNA='ifany':

> a <- c(NA, NA, NA)
> b <- c(3, 3, 4)
> table(a, b)
< table of extent 0 x 2 >
> table(a, b, useNA='ifany')
      b
a      3 4
  <NA> 2 1

nickeubank · 2016-04-04T14:51:53Z

It's also what Stata does if one uses the option ", missing"
On Mon, Apr 4, 2016 at 1:15 AM rosnfeld [email protected] wrote:

Correct. That's similar to what R would do with useNA='ifany':

a <- c(NA, NA, NA)
b <- c(3, 3, 4)
table(a, b)
< table of extent 0 x 2 >
table(a, b, useNA='ifany')
b
a 3 4
2 1

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#10772 (comment)

XikunHan · 2018-01-11T05:27:29Z

dropna : boolean, default True
Do not include columns whose entries are all NaN

Only works when the whole column is NaN, but if some values are NaN, cannot count them

oriolmirosa · 2018-02-10T01:03:55Z

This also strikes me as odd and inconsistent, and I'd love to see a fix. Meanwhile, the way I deal with it is by adding .fillna(' NaN) to the variables called in pd.crosstab(). For instance, in the example above:

pd.crosstab(df.a.fillna(' NaN), df.b.fillna(' NaN), dropna=False)

Of course, you could omit the .fillna(' NaN') from the column with no null values. I add a space before NaN so that the NaN row and/or column appear first. Otherwise, they would be ordered alphabetically and probably be in the middle of the rows/columns in the output (I guess you could prepend it by a z to put it as the last row/column, but that looks a little silly).

ceilingcat · 2018-03-26T20:18:14Z

I also would like to see this fixed.

TomAugspurger · 2018-03-27T01:49:49Z

@ceilingcat let us know if you need help starting a PR.

dsaxton · 2018-10-24T19:00:49Z

I agree this would make a nice improvement, especially since the fillna fix doesn't really work for categorical columns since the fill value needs to be an available level.

pauljohn32 · 2019-04-09T20:59:12Z

Do I have just another example of same problem? Pandas 0.24

I converted some elements in x to missings as `y``,

In [253]: x
Out[253]: 
0     1
1     2
2     3
3     4
4     5
5    55
6    22
7    44
8    12
9    55
dtype: int64

In [254]: y1
Out[254]: 
0     1.0
1     2.0
2     NaN
3     4.0
4     5.0
5    55.0
6     NaN
7    44.0
8     NaN
9    55.0
dtype: float64

In [255]: y1.value_counts(dropna=False)
Out[255]: 
NaN     3
55.0    2
44.0    1
5.0     1
4.0     1
2.0     1
1.0     1
dtype: int64

In [256]: pd.crosstab(x, y1, dropna=False)
Out[258]: 
col_0  1.0   2.0   4.0   5.0   44.0  55.0
row_0                                    
1         1     0     0     0     0     0
2         0     1     0     0     0     0
4         0     0     1     0     0     0
5         0     0     0     1     0     0
44        0     0     0     0     1     0
55        0     0     0     0     0     2

I expect to see a column labeled "NaN" with values

        1.0   2.0   4.0  5.0  44.0  55.0  NaN
1         1     0     0     0     0     0          0
3         0     0     0     0     0     0          1
2         0     1     0     0     0     0          0
4         0     0     1     0     0     0          0
5         0     0     0     1     0     0          0
12       0     0     0     0     0     0          1
22       0     0     0     0     0     0          1
44       0     0     0     0     1     0          0
55       0     0     0     0     0     2          0

kflamm · 2019-05-20T20:52:10Z

The current (0.24) Pandas documentation should say dropna: "Do not include columns OR ROWS whose entries are all NaN", because that is what the current behavior actually seems to be: when rows/columns are entirely empty, rows/columns are dropped with default dropna = True.

With dropna=False, empty rows show up as 0 value entries in columns, and empty columns show up as 0 values in rows. Previous example shows that NaN is apparently not considered to be a possible row/column value with dropna=False.

It seems reasonable to suggest that NaN be added to the set of possible values for both rows and columns in the dataframe created when crosstab has dropna=False set by a user. That would replicate behavior in other statistical software (e.g., Stata when tab used with "missing" option). More importantly, using this option necessarily means you are interested in understanding the joint distribution of NaNs in the two series.

Current behavior of crosstab('y_variable','x_variable',dropna=False) appears to be equivalent to groupby('y_variable').agg({'x_variable':'count'}) with a binary dummy x variable.

To preserve this equivalence, it may be worthwhile to instead consider a switch adding NaN to the sets of tabulated values in a crosstab rather than changing behavior with dropna=False.

russellpierce · 2019-08-07T17:04:11Z

Is there a reasonably clean way to get the desired output within the context of Pandas as it is today?

buhtz · 2022-05-18T14:03:48Z

Is there any planing or roadmap involving this Issue?

MarcoGorelli · 2022-05-18T22:52:05Z

@Codeberg-AsGithubAlternative-buhtz it doesn't look like anyone is currently working on this, but you're welcome to take it on if you'd like to - check the contributing guide for how to get started

I'm inclined to agree that it'd be an improvement to make this consistent with the dropna in value_counts

buhtz · 2022-05-19T06:51:13Z

@MarcoGorelli Please stop blaming others to not contributing. On my side this feels inpolite.

I haven't asked about who is working on this or when is it done. The Issue comments indicating that currently there is no decision on site of the maintainers about how a solution could look like. Because of that I asked for planing and roadmap. But I understand that someone could understand that terms in the context of timeline. But I meant it more in a context of design decisions.

MarcoGorelli · 2022-05-19T08:04:48Z

Hi @Codeberg-AsGithubAlternative-buhtz

I didn't blame you, it just sounded like you were interested in working on this

Are you asking about whether there's agreement that the expected output from the original post is the one to aim for?

If so, then I agree that it would be an improvement. Seems like @dsaxton agrees too, and @TomAugspurger has welcomed a PR above.
I presume @jreback agrees too, as the issue hasn't been closed and he's added labels to it

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Aug 11, 2015

jreback added this to the Next Major Release milestone Aug 11, 2015

jreback mentioned this issue Feb 12, 2016

crosstab is not respecting filtered scope when dtype=category #12298

Closed

nickeubank mentioned this issue Apr 3, 2016

Add normalization to crosstab #12578

Closed

jreback added Difficulty Intermediate labels Apr 4, 2016

sinhrks mentioned this issue Jul 4, 2016

Unexpected behavior -- crosstab with dropna=False #13556

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug and removed API Design labels Apr 18, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

mroeschke mentioned this issue May 12, 2023

BUG: crosstab(dropna=False) did not keep np.nan in result #53205

Merged

5 tasks

phofl closed this as completed in #53205 May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772

BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772

rosnfeld commented Aug 8, 2015

jreback commented Aug 11, 2015

rosnfeld commented Aug 15, 2015

rosnfeld commented Aug 16, 2015

rosnfeld commented Jan 27, 2016

nickeubank commented Apr 3, 2016

rosnfeld commented Apr 4, 2016

nickeubank commented Apr 4, 2016

XikunHan commented Jan 11, 2018

oriolmirosa commented Feb 10, 2018

ceilingcat commented Mar 26, 2018

TomAugspurger commented Mar 27, 2018

dsaxton commented Oct 24, 2018

pauljohn32 commented Apr 9, 2019 •

edited

Loading

kflamm commented May 20, 2019 •

edited

Loading

russellpierce commented Aug 7, 2019 •

edited

Loading

buhtz commented May 18, 2022

MarcoGorelli commented May 18, 2022

buhtz commented May 19, 2022

MarcoGorelli commented May 19, 2022

BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772

BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772

Comments

rosnfeld commented Aug 8, 2015

jreback commented Aug 11, 2015

rosnfeld commented Aug 15, 2015

rosnfeld commented Aug 16, 2015

rosnfeld commented Jan 27, 2016

nickeubank commented Apr 3, 2016

rosnfeld commented Apr 4, 2016

nickeubank commented Apr 4, 2016

XikunHan commented Jan 11, 2018

oriolmirosa commented Feb 10, 2018

ceilingcat commented Mar 26, 2018

TomAugspurger commented Mar 27, 2018

dsaxton commented Oct 24, 2018

pauljohn32 commented Apr 9, 2019 • edited Loading

kflamm commented May 20, 2019 • edited Loading

russellpierce commented Aug 7, 2019 • edited Loading

buhtz commented May 18, 2022

MarcoGorelli commented May 18, 2022

buhtz commented May 19, 2022

MarcoGorelli commented May 19, 2022

pauljohn32 commented Apr 9, 2019 •

edited

Loading

kflamm commented May 20, 2019 •

edited

Loading

russellpierce commented Aug 7, 2019 •

edited

Loading