-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Crosstab bug in #18321 #19326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Crosstab bug in #18321 #19326
Conversation
pandas/core/reshape/pivot.py
Outdated
|
||
table = df.pivot_table('__dummy__', index=rownames, columns=colnames, | ||
table = df.pivot_table(['__dummy__'], index=rownames, columns=colnames, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing array so that pivot_table
won't try to remove the extra dummy layers
Codecov Report
@@ Coverage Diff @@
## master #19326 +/- ##
==========================================
+ Coverage 91.53% 91.54% +0.01%
==========================================
Files 150 150
Lines 48739 48709 -30
==========================================
- Hits 44612 44592 -20
+ Misses 4127 4117 -10
Continue to review full report at Codecov.
|
pandas/core/reshape/pivot.py
Outdated
margins=margins, margins_name=margins_name, | ||
dropna=dropna, **kwargs) | ||
|
||
if not table.empty: | ||
table = table[added_column] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you don't need all of the computation of the added_column to be above, just here.
also, just use the Index operations themselves.
df.columns.difference(common_columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. I'll look into this operation
pandas/tests/reshape/test_pivot.py
Outdated
@@ -1638,3 +1639,6 @@ def test_crosstab_tuple_name(self, names): | |||
|
|||
result = pd.crosstab(s1, s2) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
result_col_list = list(result.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a more explicit test here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will update.
Hello @bluetaxi! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 24, 2018 at 10:31 Hours UTC |
pandas/tests/reshape/test_pivot.py
Outdated
s1 = pd.Series(range(3), name=names[0]) | ||
s2 = pd.Series(range(1, 4), name=names[1]) | ||
result = pd.crosstab(s1, s2) | ||
result_col_list = list(result.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need full comparisons on the result, IOW construct the expected frame and compare. o need to just compare the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. To be updated.
pandas/core/reshape/pivot.py
Outdated
margins=margins, margins_name=margins_name, | ||
dropna=dropna, **kwargs) | ||
|
||
if not table.empty: | ||
added_cols_idx = df.columns.difference(common_cols_idx).values[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this guaranteed to be len 1?
use
added_cols_index = list(df.columns.difference(common_cols_idx))[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Guaranteed length 1 since __dummy__
is an added column not in original df (if block above the pivot_table call. Will update this line as required.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -509,6 +509,7 @@ Reshaping | |||
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`) | |||
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`) | |||
- Bug in timezone comparisons, manifesting as a conversion of the index to UTC in ``.concat()`` (:issue:`18523`) | |||
- Bug in :func:`crosstab` where the added column is removed incorrectly (:issue:`18321`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this more informative. from a user perspective the bug was that you had an incorrectly named output index when you were crosstabbing 2 series with tuples names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted.
pandas/core/reshape/pivot.py
Outdated
margins=margins, margins_name=margins_name, | ||
dropna=dropna, **kwargs) | ||
|
||
if not table.empty: | ||
added_cols_idx = list(df.columns.difference(common_cols_idx))[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment here on why you are removing this column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted.
pandas/tests/reshape/test_pivot.py
Outdated
rows = [[1, 2, 3, 4], [1, 1, 2, 2], [1, 3, 1, 4]] | ||
cols = [[1, 1, 1, 1], [3, 2, 2, 3], []] | ||
|
||
expected_ct1 = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the expected IN the parameterize, e.g.
@pytest.mark.parametrize(
"names, expected",
[
([['a', 'b'], pd.DataFrame(......)),
(.....)
(.....)
])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback This might not be possible as for each "name", there are 3 different expected-s. The only parameter here is "name". Am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make that a parameter as well if needed
this is just a convenient way of writing out the cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll do something like parametrize( "name, expected1, expected2, expected3", [...])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
""name,input,expected", [(....), (...)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm. Ok. Thanks for the pointer. I'm still new at pytest. Will do that then.
@jreback please have a look at the new changes. appveyor fails to build, but it's intermittent error. If you could kick appveyor start again, it'll be good (the same code past last 3 times) |
@pytest.mark.parametrize("names, input_data, expected_data_out", [ | ||
(['a', 'b'], [[1, 2, 3], [1, 1, 1]], [1, 1, 1]), | ||
([('a', 'b'), 'c'], [[1, 2, 2], [1, 1, 1]], [1, 2]), | ||
([('a', 'b'), ('c', 'd')], [[1, 2, 3], [1, 2, 3]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u make input_data into 2, call them row_data and col_data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I was hoping for this to be consistent with "names", which is consistent with the test above this test. Is there a reason to split these? In general, as long as readability isn't improved, I think consistency is important.
row_series = pd.Series(input_data[0], name=names[0]) | ||
col_series = pd.Series(input_data[1], name=names[1]) | ||
expected_crosstab = pd.DataFrame( | ||
expected_data_out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a naming convention to be comply to in pandas, apart from pep8? I don't see the benefit of changing from expected_crosstab to expected. What's the reason behind this required change?
expected_crosstab = pd.DataFrame( | ||
expected_data_out, | ||
index=pd.Index(set(input_data[0]), name=names[0]), | ||
columns=pd.Index(set(input_data[1]), name=names[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason u r using set here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using set because input_data contains duplicating entries. pd.Index([1,1,1,1]) != pd.Index([1])
) | ||
tm.assert_frame_equal( | ||
pd.crosstab(row_series, col_series), expected_crosstab, | ||
check_exact=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have another line
result = pd.crosstab....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it won't improve readability that much. Is there a reason you request explicit variable here?
git diff upstream/master -u -- "*.py" | flake8 --diff
Fixing bug in pd.crosstab(s1, s2) keeps dummy MultiIndex as columns if both s1 and s2 have tuple name #18321
The bug is because the function
df.pivot_table
tries to remove only the top level of the column in crosstab data frame. When the series names are tupples, the added column__dummy__
in functioncrosstab
has several levels, meaning removing only the top layer in the resulting crosstab as done in the currentpivot_table
function won't be enough. The fix keeps track of the exact added column, then remove the extra levels in resulting crosstab data frame it in the same function rather than delegate the removal todf.pivot_table
. Based on my investigation,df.pivot_table
removes the dummy layers after all the calculation, so moving the removal of the extra layers outside won't affect the current behavior. Also, by add the columns and removing the resulting extra layers in the crosstab in the same function, the logic is easier to read.