Skip to content

BUG: Crosstab bug in #18321 #19326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -509,6 +509,7 @@ Reshaping
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`)
- Bug in timezone comparisons, manifesting as a conversion of the index to UTC in ``.concat()`` (:issue:`18523`)
- Bug in :func:`crosstab` where the added column is removed incorrectly (:issue:`18321`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this more informative. from a user perspective the bug was that you had an incorrectly named output index when you were crosstabbing 2 series with tuples names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted.

-

Numeric
Expand Down
8 changes: 7 additions & 1 deletion pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -455,17 +455,23 @@ def crosstab(index, columns, values=None, rownames=None, colnames=None,

from pandas import DataFrame
df = DataFrame(data, index=common_idx)
common_cols_idx = df.columns

if values is None:
df['__dummy__'] = 0
kwargs = {'aggfunc': len, 'fill_value': 0}
else:
df['__dummy__'] = values
kwargs = {'aggfunc': aggfunc}

table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
table = df.pivot_table(['__dummy__'], index=rownames, columns=colnames,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing array so that pivot_table won't try to remove the extra dummy layers

margins=margins, margins_name=margins_name,
dropna=dropna, **kwargs)

if not table.empty:
added_cols_idx = list(df.columns.difference(common_cols_idx))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here on why you are removing this column

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted.

table = table[added_cols_idx]

# Post-process
if normalize is not False:
table = _normalize(table, normalize=normalize, margins=margins,
Expand Down
31 changes: 30 additions & 1 deletion pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -1628,7 +1628,8 @@ def test_crosstab_dup_index_names(self):
pytest.raises(ValueError, pd.crosstab, s, s)

@pytest.mark.parametrize("names", [['a', ('b', 'c')],
[('a', 'b'), 'c']])
[('a', 'b'), 'c'],
[('a', 'b'), ('c', 'd')]])
def test_crosstab_tuple_name(self, names):
s1 = pd.Series(range(3), name=names[0])
s2 = pd.Series(range(1, 4), name=names[1])
Expand All @@ -1638,3 +1639,31 @@ def test_crosstab_tuple_name(self, names):

result = pd.crosstab(s1, s2)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("names", [['a', 'b'],
[('a', 'b'), 'c'],
[('a', 'b'), ('c', 'd')],
[(1, 2, 3), ('a', 'b', 'c')]])
def test_crosstab_cols_output(self, names):
rows = [[1, 2, 3, 4], [1, 1, 2, 2], [1, 3, 1, 4]]
cols = [[1, 1, 1, 1], [3, 2, 2, 3], []]

expected_ct1 = pd.DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the expected IN the parameterize, e.g.

@pytest.mark.parametrize(
"names, expected",
[
([['a', 'b'], pd.DataFrame(......)),
(.....)
(.....)
])

Copy link
Author

@ghost ghost Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback This might not be possible as for each "name", there are 3 different expected-s. The only parameter here is "name". Am I missing anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make that a parameter as well if needed
this is just a convenient way of writing out the cases

Copy link
Author

@ghost ghost Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll do something like parametrize( "name, expected1, expected2, expected3", [...])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

""name,input,expected", [(....), (...)]

Copy link
Author

@ghost ghost Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm. Ok. Thanks for the pointer. I'm still new at pytest. Will do that then.

[1, 1, 1, 1],
index=pd.Index([1, 2, 3, 4], name=names[0]),
columns=pd.Index([1], name=names[1])
)
expected_ct2 = pd.DataFrame(
[[1, 1], [1, 1]],
index=pd.Index([1, 2], name=names[0]),
columns=pd.Index([2, 3], name=names[1])
)
expected_ct3 = pd.DataFrame([])
expected_arr = [expected_ct1, expected_ct2, expected_ct3]

for row, col, expected_data in zip(rows, cols, expected_arr):
s1 = pd.Series(row, name=names[0])
s2 = pd.Series(col, name=names[1])
result = pd.crosstab(s1, s2)
tm.assert_frame_equal(result, expected_data,
check_column_type=True)