Skip to content

BUG: Crosstab bug in #18321 #19326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

BUG: Crosstab bug in #18321 #19326

wants to merge 15 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jan 20, 2018

  • closes pd.crosstab(s1, s2) keeps dummy MultiIndex as columns if both s1 and s2 have tuple name #18321
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
    Fixing bug in pd.crosstab(s1, s2) keeps dummy MultiIndex as columns if both s1 and s2 have tuple name #18321
    The bug is because the function df.pivot_table tries to remove only the top level of the column in crosstab data frame. When the series names are tupples, the added column __dummy__ in function crosstab has several levels, meaning removing only the top layer in the resulting crosstab as done in the current pivot_table function won't be enough. The fix keeps track of the exact added column, then remove the extra levels in resulting crosstab data frame it in the same function rather than delegate the removal to df.pivot_table. Based on my investigation, df.pivot_table removes the dummy layers after all the calculation, so moving the removal of the extra layers outside won't affect the current behavior. Also, by add the columns and removing the resulting extra layers in the crosstab in the same function, the logic is easier to read.


table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
table = df.pivot_table(['__dummy__'], index=rownames, columns=colnames,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing array so that pivot_table won't try to remove the extra dummy layers

@ghost ghost changed the title Crosstab bug BUG: Crosstab bug in #18321 Jan 20, 2018
@codecov
Copy link

codecov bot commented Jan 21, 2018

Codecov Report

Merging #19326 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19326      +/-   ##
==========================================
+ Coverage   91.53%   91.54%   +0.01%     
==========================================
  Files         150      150              
  Lines       48739    48709      -30     
==========================================
- Hits        44612    44592      -20     
+ Misses       4127     4117      -10
Flag Coverage Δ
#multiple 89.91% <100%> (+0.01%) ⬆️
#single 41.71% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/pivot.py 96.4% <100%> (+0.04%) ⬆️
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️
pandas/core/indexes/timedeltas.py 90.58% <0%> (-0.02%) ⬇️
pandas/core/sparse/series.py 95.32% <0%> (-0.02%) ⬇️
pandas/core/base.py 96.76% <0%> (-0.01%) ⬇️
pandas/core/indexing.py 92.62% <0%> (-0.01%) ⬇️
pandas/core/indexes/datetimelike.py 97.1% <0%> (-0.01%) ⬇️
pandas/core/arrays/categorical.py 95.77% <0%> (-0.01%) ⬇️
pandas/core/series.py 94.6% <0%> (-0.01%) ⬇️
pandas/core/panel.py 96.83% <0%> (-0.01%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1245f06...c0f1d29. Read the comment docs.

margins=margins, margins_name=margins_name,
dropna=dropna, **kwargs)

if not table.empty:
table = table[added_column]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you don't need all of the computation of the added_column to be above, just here.

also, just use the Index operations themselves.

df.columns.difference(common_columns)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. I'll look into this operation

@@ -1638,3 +1639,6 @@ def test_crosstab_tuple_name(self, names):

result = pd.crosstab(s1, s2)
tm.assert_frame_equal(result, expected)

result_col_list = list(result.columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a more explicit test here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will update.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 21, 2018
@pep8speaks
Copy link

pep8speaks commented Jan 22, 2018

Hello @bluetaxi! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 24, 2018 at 10:31 Hours UTC

s1 = pd.Series(range(3), name=names[0])
s2 = pd.Series(range(1, 4), name=names[1])
result = pd.crosstab(s1, s2)
result_col_list = list(result.columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need full comparisons on the result, IOW construct the expected frame and compare. o need to just compare the columns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. To be updated.

margins=margins, margins_name=margins_name,
dropna=dropna, **kwargs)

if not table.empty:
added_cols_idx = df.columns.difference(common_cols_idx).values[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this guaranteed to be len 1?

use

added_cols_index = list(df.columns.difference(common_cols_idx))[0]

Copy link
Author

@ghost ghost Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Guaranteed length 1 since __dummy__ is an added column not in original df (if block above the pivot_table call. Will update this line as required.

@@ -509,6 +509,7 @@ Reshaping
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`)
- Bug in timezone comparisons, manifesting as a conversion of the index to UTC in ``.concat()`` (:issue:`18523`)
- Bug in :func:`crosstab` where the added column is removed incorrectly (:issue:`18321`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this more informative. from a user perspective the bug was that you had an incorrectly named output index when you were crosstabbing 2 series with tuples names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted.

margins=margins, margins_name=margins_name,
dropna=dropna, **kwargs)

if not table.empty:
added_cols_idx = list(df.columns.difference(common_cols_idx))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here on why you are removing this column

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted.

rows = [[1, 2, 3, 4], [1, 1, 2, 2], [1, 3, 1, 4]]
cols = [[1, 1, 1, 1], [3, 2, 2, 3], []]

expected_ct1 = pd.DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the expected IN the parameterize, e.g.

@pytest.mark.parametrize(
"names, expected",
[
([['a', 'b'], pd.DataFrame(......)),
(.....)
(.....)
])

Copy link
Author

@ghost ghost Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback This might not be possible as for each "name", there are 3 different expected-s. The only parameter here is "name". Am I missing anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make that a parameter as well if needed
this is just a convenient way of writing out the cases

Copy link
Author

@ghost ghost Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll do something like parametrize( "name, expected1, expected2, expected3", [...])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

""name,input,expected", [(....), (...)]

Copy link
Author

@ghost ghost Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm. Ok. Thanks for the pointer. I'm still new at pytest. Will do that then.

@ghost
Copy link
Author

ghost commented Jan 24, 2018

@jreback please have a look at the new changes. appveyor fails to build, but it's intermittent error. If you could kick appveyor start again, it'll be good (the same code past last 3 times)

@pytest.mark.parametrize("names, input_data, expected_data_out", [
(['a', 'b'], [[1, 2, 3], [1, 1, 1]], [1, 1, 1]),
([('a', 'b'), 'c'], [[1, 2, 2], [1, 1, 1]], [1, 2]),
([('a', 'b'), ('c', 'd')], [[1, 2, 3], [1, 2, 3]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make input_data into 2, call them row_data and col_data)

Copy link
Author

@ghost ghost Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I was hoping for this to be consistent with "names", which is consistent with the test above this test. Is there a reason to split these? In general, as long as readability isn't improved, I think consistency is important.

row_series = pd.Series(input_data[0], name=names[0])
col_series = pd.Series(input_data[1], name=names[1])
expected_crosstab = pd.DataFrame(
expected_data_out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this expected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a naming convention to be comply to in pandas, apart from pep8? I don't see the benefit of changing from expected_crosstab to expected. What's the reason behind this required change?

expected_crosstab = pd.DataFrame(
expected_data_out,
index=pd.Index(set(input_data[0]), name=names[0]),
columns=pd.Index(set(input_data[1]), name=names[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason u r using set here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using set because input_data contains duplicating entries. pd.Index([1,1,1,1]) != pd.Index([1])

)
tm.assert_frame_equal(
pd.crosstab(row_series, col_series), expected_crosstab,
check_exact=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have another line

result = pd.crosstab....

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't improve readability that much. Is there a reason you request explicit variable here?

@ghost ghost closed this Jan 24, 2018
@ghost ghost deleted the crosstab-bug branch January 24, 2018 12:15
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.crosstab(s1, s2) keeps dummy MultiIndex as columns if both s1 and s2 have tuple name
2 participants