-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: crosstab(dropna=False) does not work like value_counts(dropna=False) #10772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you show what you would expect as ouptput? |
@jreback I updated the original description to include the expected output. This further example (a slightly modified version of one from the crosstab docs) shows what In [69]: a = np.array(['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'])
In [70]: b = np.array(['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'one', 'one', 'one'])
In [71]: c = np.array(['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', 'dull', 'shiny', 'shiny', 'shiny'])
In [72]: pd.crosstab(a, [b, c], dropna=True)
Out[72]:
col_0 one two
col_1 dull shiny dull
row_0
bar 1 2 1
foo 2 4 1
In [73]: pd.crosstab(a, [b, c], dropna=False)
Out[73]:
one two
dull shiny dull shiny
row_0
bar 1 2 1 0
foo 2 4 1 0 I think we need a different name for this behavior. Or perhaps the behavior I am looking for should have the new name, like |
My R is very rusty, but it does support this feature, just for reference of how another system does it. (I actually don't know how to do in R what pandas is currently doing with
|
I keep running into this. Hopefully I will have time to look at it soon, or it will get fixed as a result of some of the missing-value-handling issues that people are hoping to resolve with a change to pandas underlying data representations. Here's a related issue (could perhaps be its own issue). Set up x and y as above, and then try and use crosstab with margins: In [8]: pd.crosstab(x, y, margins=True)
Out[8]:
col_0 c d All
row_0
a 1 1 3
b 0 1 1
All 2 2 5
In [9]: # huh that's weird, the columns don't sum
In [10]: pd.crosstab(x, y, dropna=False, margins=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
4417 raise ValueError('The name %s occurs multiple times, use a '
-> 4418 'level number' % level)
4419 level = self.names.index(level)
ValueError: The name None occurs multiple times, use a level number
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-10-b11080c1dcf8> in <module>()
----> 1 pd.crosstab(x, y, dropna=False, margins=True)
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, dropna)
426 df['__dummy__'] = 0
427 table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
--> 428 aggfunc=len, margins=margins, dropna=dropna)
429 return table.fillna(0).astype(np.int64)
430 else:
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
145 table = _add_margins(table, data, values, rows=index,
146 cols=columns, aggfunc=aggfunc,
--> 147 margins_name=margins_name)
148
149 # discard the top level
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in _add_margins(table, data, values, rows, cols, aggfunc, margins_name)
175 if hasattr(table, 'columns'):
176 for level in table.columns.names[1:]:
--> 177 if margins_name in table.columns.get_level_values(level):
178 raise ValueError(exception_msg)
179
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in get_level_values(self, level)
4565 values : ndarray
4566 """
-> 4567 num = self._get_level_number(level)
4568 unique = self.levels[num] # .values
4569 labels = self.labels[num]
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
4420 except ValueError:
4421 if not isinstance(level, int):
-> 4422 raise KeyError('Level %s not found' % str(level))
4423 elif level < 0:
4424 level += self.nlevels
KeyError: 'Level None not found' |
Agreed. @rosnfeld, if I follow right, you think the behavior should be that:
returns:
Not an empty dataframe? |
Correct. That's similar to what R would do with
|
It's also what Stata does if one uses the option ", missing"
|
dropna : boolean, default True Only works when the whole column is NaN, but if some values are NaN, cannot count them |
This also strikes me as odd and inconsistent, and I'd love to see a fix. Meanwhile, the way I deal with it is by adding
Of course, you could omit the |
I also would like to see this fixed. |
@ceilingcat let us know if you need help starting a PR. |
I agree this would make a nice improvement, especially since the |
Do I have just another example of same problem? Pandas 0.24 I converted some elements in
I expect to see a column labeled "NaN" with values
|
The current (0.24) Pandas documentation should say dropna: "Do not include columns OR ROWS whose entries are all NaN", because that is what the current behavior actually seems to be: when rows/columns are entirely empty, rows/columns are dropped with default dropna = True. With dropna=False, empty rows show up as 0 value entries in columns, and empty columns show up as 0 values in rows. Previous example shows that NaN is apparently not considered to be a possible row/column value with dropna=False. It seems reasonable to suggest that NaN be added to the set of possible values for both rows and columns in the dataframe created when crosstab has dropna=False set by a user. That would replicate behavior in other statistical software (e.g., Stata when tab used with "missing" option). More importantly, using this option necessarily means you are interested in understanding the joint distribution of NaNs in the two series. Current behavior of crosstab('y_variable','x_variable',dropna=False) appears to be equivalent to groupby('y_variable').agg({'x_variable':'count'}) with a binary dummy x variable. To preserve this equivalence, it may be worthwhile to instead consider a switch adding NaN to the sets of tabulated values in a crosstab rather than changing behavior with dropna=False. |
Is there a reasonably clean way to get the desired output within the context of Pandas as it is today? |
Is there any planing or roadmap involving this Issue? |
@Codeberg-AsGithubAlternative-buhtz it doesn't look like anyone is currently working on this, but you're welcome to take it on if you'd like to - check the contributing guide for how to get started I'm inclined to agree that it'd be an improvement to make this consistent with the dropna in value_counts |
@MarcoGorelli Please stop blaming others to not contributing. On my side this feels inpolite. I haven't asked about who is working on this or when is it done. The Issue comments indicating that currently there is no decision on site of the maintainers about how a solution could look like. Because of that I asked for planing and roadmap. But I understand that someone could understand that terms in the context of timeline. But I meant it more in a context of design decisions. |
Hi @Codeberg-AsGithubAlternative-buhtz I didn't blame you, it just sounded like you were interested in working on this Are you asking about whether there's agreement that the expected output from the original post is the one to aim for? If so, then I agree that it would be an improvement. Seems like @dsaxton agrees too, and @TomAugspurger has welcomed a PR above. |
If one uses
series.value_counts(dropna=False)
it includesNaN
values, butcrosstab(series1, series2, dropna=False)
does not includeNaN
values. I think ofcross_tab
kind of like 2-dimensionalvalue_counts()
, so this is surprising.I believe what
crosstab(..., dropna=False)
really does is just include rows/columns that would otherwise have all zeros, but that doesn't seem the same asdropna=False
to me. I would have expected to see a row and column entry forNaN
, something like:Thoughts on this?
This is on current master (almost 0.17).
The text was updated successfully, but these errors were encountered: