BUG: Crosstab bug in #18321 #19326

ghost · 2018-01-20T23:28:34Z

closes pd.crosstab(s1, s2) keeps dummy MultiIndex as columns if both s1 and s2 have tuple name #18321
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
Fixing bug in pd.crosstab(s1, s2) keeps dummy MultiIndex as columns if both s1 and s2 have tuple name #18321
The bug is because the function df.pivot_table tries to remove only the top level of the column in crosstab data frame. When the series names are tupples, the added column __dummy__ in function crosstab has several levels, meaning removing only the top layer in the resulting crosstab as done in the current pivot_table function won't be enough. The fix keeps track of the exact added column, then remove the extra levels in resulting crosstab data frame it in the same function rather than delegate the removal to df.pivot_table. Based on my investigation, df.pivot_table removes the dummy layers after all the calculation, so moving the removal of the extra layers outside won't affect the current behavior. Also, by add the columns and removing the resulting extra layers in the crosstab in the same function, the logic is easier to read.

ghost · 2018-01-20T23:29:26Z

pandas/core/reshape/pivot.py


-    table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
+    table = df.pivot_table(['__dummy__'], index=rownames, columns=colnames,


Passing array so that pivot_table won't try to remove the extra dummy layers

codecov · 2018-01-21T05:01:32Z

Codecov Report

Merging #19326 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19326      +/-   ##
==========================================
+ Coverage   91.53%   91.54%   +0.01%     
==========================================
  Files         150      150              
  Lines       48739    48709      -30     
==========================================
- Hits        44612    44592      -20     
+ Misses       4127     4117      -10

Flag	Coverage Δ
#multiple	`89.91% <100%> (+0.01%)`	⬆️
#single	`41.71% <0%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/pivot.py	`96.4% <100%> (+0.04%)`	⬆️
pandas/plotting/_converter.py	`65.22% <0%> (-1.74%)`	⬇️
pandas/core/indexes/timedeltas.py	`90.58% <0%> (-0.02%)`	⬇️
pandas/core/sparse/series.py	`95.32% <0%> (-0.02%)`	⬇️
pandas/core/base.py	`96.76% <0%> (-0.01%)`	⬇️
pandas/core/indexing.py	`92.62% <0%> (-0.01%)`	⬇️
pandas/core/indexes/datetimelike.py	`97.1% <0%> (-0.01%)`	⬇️
pandas/core/arrays/categorical.py	`95.77% <0%> (-0.01%)`	⬇️
pandas/core/series.py	`94.6% <0%> (-0.01%)`	⬇️
pandas/core/panel.py	`96.83% <0%> (-0.01%)`	⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1245f06...c0f1d29. Read the comment docs.

jreback · 2018-01-21T15:37:31Z

pandas/core/reshape/pivot.py

                           margins=margins, margins_name=margins_name,
                           dropna=dropna, **kwargs)

+    if not table.empty:
+        table = table[added_column]


so you don't need all of the computation of the added_column to be above, just here.

also, just use the Index operations themselves.

df.columns.difference(common_columns)

Noted. I'll look into this operation

jreback · 2018-01-21T15:39:43Z

pandas/tests/reshape/test_pivot.py

@@ -1638,3 +1639,6 @@ def test_crosstab_tuple_name(self, names):

        result = pd.crosstab(s1, s2)
        tm.assert_frame_equal(result, expected)
+
+        result_col_list = list(result.columns)


I would like a more explicit test here

Thanks. Will update.

…uest

pep8speaks · 2018-01-22T10:14:40Z

Hello @bluetaxi! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 24, 2018 at 10:31 Hours UTC

jreback · 2018-01-22T11:43:32Z

pandas/tests/reshape/test_pivot.py

+        s1 = pd.Series(range(3), name=names[0])
+        s2 = pd.Series(range(1, 4), name=names[1])
+        result = pd.crosstab(s1, s2)
+        result_col_list = list(result.columns)


these need full comparisons on the result, IOW construct the expected frame and compare. o need to just compare the columns.

Noted. To be updated.

jreback · 2018-01-22T11:44:22Z

pandas/core/reshape/pivot.py

                           margins=margins, margins_name=margins_name,
                           dropna=dropna, **kwargs)

+    if not table.empty:
+        added_cols_idx = df.columns.difference(common_cols_idx).values[0]


is this guaranteed to be len 1?

use

added_cols_index = list(df.columns.difference(common_cols_idx))[0]

Yes. Guaranteed length 1 since __dummy__ is an added column not in original df (if block above the pivot_table call. Will update this line as required.

jreback · 2018-01-23T11:29:04Z

doc/source/whatsnew/v0.23.0.txt

@@ -509,6 +509,7 @@ Reshaping
 - Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
 - Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`)
 - Bug in timezone comparisons, manifesting as a conversion of the index to UTC in ``.concat()`` (:issue:`18523`)
+- Bug in :func:`crosstab` where the added column is removed incorrectly (:issue:`18321`)


can you make this more informative. from a user perspective the bug was that you had an incorrectly named output index when you were crosstabbing 2 series with tuples names.

jreback · 2018-01-23T11:29:55Z

pandas/core/reshape/pivot.py

                           margins=margins, margins_name=margins_name,
                           dropna=dropna, **kwargs)

+    if not table.empty:
+        added_cols_idx = list(df.columns.difference(common_cols_idx))[0]


comment here on why you are removing this column

jreback · 2018-01-23T11:31:26Z

pandas/tests/reshape/test_pivot.py

+        rows = [[1, 2, 3, 4], [1, 1, 2, 2], [1, 3, 1, 4]]
+        cols = [[1, 1, 1, 1], [3, 2, 2, 3], []]
+
+        expected_ct1 = pd.DataFrame(


put the expected IN the parameterize, e.g.

@pytest.mark.parametrize( "names, expected", [ ([['a', 'b'], pd.DataFrame(......)), (.....) (.....) ])

@jreback This might not be possible as for each "name", there are 3 different expected-s. The only parameter here is "name". Am I missing anything?

make that a parameter as well if needed
this is just a convenient way of writing out the cases

Ok. I'll do something like parametrize( "name, expected1, expected2, expected3", [...])

no

""name,input,expected", [(....), (...)]

Uhm. Ok. Thanks for the pointer. I'm still new at pytest. Will do that then.

…ode logic

ghost · 2018-01-24T11:42:25Z

@jreback please have a look at the new changes. appveyor fails to build, but it's intermittent error. If you could kick appveyor start again, it'll be good (the same code past last 3 times)

jreback · 2018-01-24T11:45:20Z

pandas/tests/reshape/test_pivot.py

+    @pytest.mark.parametrize("names, input_data, expected_data_out", [
+        (['a', 'b'], [[1, 2, 3], [1, 1, 1]], [1, 1, 1]),
+        ([('a', 'b'), 'c'], [[1, 2, 2], [1, 1, 1]], [1, 2]),
+        ([('a', 'b'), ('c', 'd')], [[1, 2, 3], [1, 2, 3]],


can u make input_data into 2, call them row_data and col_data)

@jreback I was hoping for this to be consistent with "names", which is consistent with the test above this test. Is there a reason to split these? In general, as long as readability isn't improved, I think consistency is important.

jreback · 2018-01-24T11:45:32Z

pandas/tests/reshape/test_pivot.py

+        row_series = pd.Series(input_data[0], name=names[0])
+        col_series = pd.Series(input_data[1], name=names[1])
+        expected_crosstab = pd.DataFrame(
+            expected_data_out,


call this expected

Is there a naming convention to be comply to in pandas, apart from pep8? I don't see the benefit of changing from expected_crosstab to expected. What's the reason behind this required change?

jreback · 2018-01-24T11:45:55Z

pandas/tests/reshape/test_pivot.py

+        expected_crosstab = pd.DataFrame(
+            expected_data_out,
+            index=pd.Index(set(input_data[0]), name=names[0]),
+            columns=pd.Index(set(input_data[1]), name=names[1])


is there a reason u r using set here?

using set because input_data contains duplicating entries. pd.Index([1,1,1,1]) != pd.Index([1])

jreback · 2018-01-24T11:46:18Z

pandas/tests/reshape/test_pivot.py

+        )
+        tm.assert_frame_equal(
+            pd.crosstab(row_series, col_series), expected_crosstab,
+            check_exact=True


have another line

result = pd.crosstab....

it won't improve readability that much. Is there a reason you request explicit variable here?

blueTaxi added 4 commits January 21, 2018 08:58

Add test

6140d5a

Add logic to fix CI failure from new test

6bcb2d7

fix CI failure

c1e87e6

Added documentation for the bug fix

af90130

ghost commented Jan 20, 2018

View reviewed changes

ghost changed the title ~~Crosstab bug~~ BUG: Crosstab bug in #18321 Jan 20, 2018

jreback requested changes Jan 21, 2018

View reviewed changes

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 21, 2018

Tuan and others added 3 commits January 22, 2018 21:01

added more test and fixed up code based on comments of first pull req…

8f4380f

…uest

Update tests

451792f

Update logic

30da7f4

fix pep8

0970999

jreback requested changes Jan 22, 2018

View reviewed changes

Tuan added 3 commits January 23, 2018 21:17

pull request 3

52072ae

merge from main and solve merge conflict

ba9006d

fix pep8

cc0b9ba

jreback requested changes Jan 23, 2018

View reviewed changes

Tuan added 4 commits January 24, 2018 21:04

pull request 4 with updated doc, parametrized tests and comments in c…

ae55326

…ode logic

fix pep8

a914b15

fix pep8 again

81dd6c2

fix test

c0f1d29

jreback requested changes Jan 24, 2018

View reviewed changes

ghost closed this Jan 24, 2018

ghost deleted the crosstab-bug branch January 24, 2018 12:15

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Crosstab bug in #18321 #19326

BUG: Crosstab bug in #18321 #19326

ghost commented Jan 20, 2018 •

edited by ghost

Loading

ghost Jan 20, 2018

codecov bot commented Jan 21, 2018 •

edited

Loading

jreback Jan 21, 2018

ghost Jan 21, 2018

jreback Jan 21, 2018

ghost Jan 21, 2018

pep8speaks commented Jan 22, 2018 •

edited

Loading

jreback Jan 22, 2018

ghost Jan 22, 2018

jreback Jan 22, 2018

ghost Jan 22, 2018 •

edited by ghost

Loading

jreback Jan 23, 2018

ghost Jan 23, 2018

jreback Jan 23, 2018

ghost Jan 23, 2018

jreback Jan 23, 2018

ghost Jan 23, 2018 •

edited by ghost

Loading

jreback Jan 23, 2018

ghost Jan 23, 2018 •

edited by ghost

Loading

jreback Jan 23, 2018

ghost Jan 23, 2018 •

edited by ghost

Loading

ghost commented Jan 24, 2018

jreback Jan 24, 2018

ghost Jan 24, 2018 •

edited by ghost

Loading

jreback Jan 24, 2018

ghost Jan 24, 2018

jreback Jan 24, 2018

ghost Jan 24, 2018

jreback Jan 24, 2018

ghost Jan 24, 2018


		table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
		table = df.pivot_table(['__dummy__'], index=rownames, columns=colnames,

BUG: Crosstab bug in #18321 #19326

BUG: Crosstab bug in #18321 #19326

Conversation

ghost commented Jan 20, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

codecov bot commented Jan 21, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Jan 22, 2018 • edited Loading

Comment last updated on January 24, 2018 at 10:31 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jan 22, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jan 23, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jan 23, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jan 23, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

ghost commented Jan 24, 2018

Choose a reason for hiding this comment

ghost Jan 24, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jan 20, 2018 •

edited by ghost

Loading

codecov bot commented Jan 21, 2018 •

edited

Loading

pep8speaks commented Jan 22, 2018 •

edited

Loading

ghost Jan 22, 2018 •

edited by ghost

Loading

ghost Jan 23, 2018 •

edited by ghost

Loading

ghost Jan 23, 2018 •

edited by ghost

Loading

ghost Jan 23, 2018 •

edited by ghost

Loading

ghost Jan 24, 2018 •

edited by ghost

Loading