Skip to content

BUG: df.pivot_table fails when margin is True and only columns is defined #31088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jan 20, 2020

Conversation

charlesdong1991
Copy link
Member

@charlesdong1991 charlesdong1991 commented Jan 16, 2020

),
),
(
["A", "B", "C"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test anything different from the test preceding it? It not I think can be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your quick response! @WillAyd

emm, this shows different number of levels for columns, is this a bit redendunt? if so, i will remove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can be removed; the one preceding already tests for a multi index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sure!

table_pieces.append(Series(margin[key], index=[all_key]))
# GH31016 this is to calculate margin for each group, and assign
# corresponded key as index
transformed_piece = DataFrame(piece.apply(aggfunc)).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any performance impacts?

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be? i was thinking of it as well

how to measure it? do you mean create a giant mock dataset and time it? any suggestions? would like to test it out!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have asvs in benchmarks/reshape.py that would be good to run here at least

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! @WillAyd i am not very familiar with asv bench, but i see some tests starting with time_pivot****, shall I add a new test in there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the existing ones first:

cd asv_bench
asv continuous upstream/master HEAD -b reshape.PivotTable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2020-01-16 at 10 58 20 PM

just copied and pasted and tested a bit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the existing ones first:

cd asv_bench
asv continuous upstream/master HEAD -b reshape.PivotTable

oops! thanks for the tip!! did not see it 😅 will run

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running asv on current i get: BENCHMARKS NOT SIGNIFICANTLY CHANGED.

but i do find an issue with self-review, will investigate and fix tomorrow. thanks for the help on asv @WillAyd

@WillAyd WillAyd added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 16, 2020
@charlesdong1991
Copy link
Member Author

Hi, @WillAyd , i think the PR is ok to review now. Please let me know how you plan to do the asv bechmark for this piece, shall I add a test in asv_bench for this one? (as mentioned above, running asv on the current locally is good)

@jreback jreback added this to the 1.1 milestone Jan 18, 2020
@jreback
Copy link
Contributor

jreback commented Jan 18, 2020

looks good, can you rebase. adding an asv would be nice as well.

@pep8speaks
Copy link

pep8speaks commented Jan 18, 2020

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-18 18:21:01 UTC

@charlesdong1991
Copy link
Member Author

emm, running asv locally with new tests added, this manipulation does perform much slower than other tests, and the second trial even failed. Any idea/How to interpret why the second one failed? is it because it is too slow and then is above the threshold? @jreback @WillAyd

I am pretty new to asv, and some feedbacks would be highly appreciated! Thanks!

Screen Shot 2020-01-18 at 7 35 11 PM

@WillAyd
Copy link
Member

WillAyd commented Jan 18, 2020

@charlesdong1991 if you can post the actual shell output instead of a screenshot typically better. That said it doesn’t look like things changed that much.

Not sure why the second run failed but you can retry with —show-stderr option

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Jan 18, 2020

[ 29.17%] ··· Running (reshape.PivotTable.time_pivot_table--)......
[ 50.00%] · For pandas commit c01d7142 <fix_issue_31016> (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 54.17%] ··· reshape.PivotTable.time_pivot_table                                                                        25.8±0.5ms
[ 58.33%] ··· reshape.PivotTable.time_pivot_table_agg                                                                      49.5±2ms
[ 62.50%] ··· reshape.PivotTable.time_pivot_table_categorical                                                              19.5±5ms
[ 66.67%] ··· reshape.PivotTable.time_pivot_table_categorical_observed                                                   11.5±0.7ms
[ 70.83%] ··· reshape.PivotTable.time_pivot_table_margins                                                                  94.7±6ms
[ 75.00%] ··· reshape.PivotTable.time_pivot_table_margins_only_column                                                      271±40ms
[ 75.00%] · For pandas commit d170cc05 <multi_pivot^2> (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
[ 79.17%] ··· reshape.PivotTable.time_pivot_table                                                                          26.3±2ms
[ 83.33%] ··· reshape.PivotTable.time_pivot_table_agg                                                                      52.3±2ms
[ 87.50%] ··· reshape.PivotTable.time_pivot_table_categorical                                                              20.7±3ms
[ 91.67%] ··· reshape.PivotTable.time_pivot_table_categorical_observed                                                   11.5±0.6ms
[ 95.83%] ··· reshape.PivotTable.time_pivot_table_margins                                                                  106±10ms
[100.00%] ··· reshape.PivotTable.time_pivot_table_margins_only_column                                                        failed

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

sorry, is this what you mean by posting actual output? @WillAyd

Yeah, thanks mate! And seems the first round is running the code in my current PR, and second one is using the code on master branch. --show-stderr option does provide some insight, and the reason is this code is not working on the current master, so it failed, see below for the result of our a new run with this option.

[100.00%] ···· Traceback (most recent call last):
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 1434, in <module>
                   main()
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 1427, in main
                   commands[mode](args)
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 1166, in main_run
                   result = benchmark.do_run()
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 573, in do_run
                   return self.run(*self._current_params)
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 672, in run
                   min_run_count=self.min_run_count)
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 704, in benchmark_timing
                   timing = timer.timeit(number)
                 File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/timeit.py", line 176, in timeit
                   timing = self.inner(it, self.timer)
                 File "<timeit-src>", line 6, in inner
                 File "/Users/cw1921/pandas-dev/asv_bench/benchmarks/reshape.py", line 165, in time_pivot_table_margins_only_column
                   self.df.pivot_table(columns=["key2", "key3"], margins=True)
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/pandas/core/frame.py", line 6104, in pivot_table
                   observed=observed,
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/pandas/core/reshape/pivot.py", line 167, in pivot_table
                   fill_value=fill_value,
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/pandas/core/reshape/pivot.py", line 237, in _add_margins
                   margins_name,
                 File "/Users/cw1921/pyvenv/pandas-dev/lib/python3.7/site-packages/pandas/core/reshape/pivot.py", line 353, in _generate_marginal_results
                   table_pieces.append(Series(margin[key], index=[all_key]))
               KeyError: 'one'
               asv: benchmark failed (exit status 1)

so i think we are good! @WillAyd @jreback not sure why the performance even gets slightly better 😅 but at least it's not getting worse, haha

@charlesdong1991
Copy link
Member Author

gentle ping @jreback @WillAyd any followup? ^^

@jreback
Copy link
Contributor

jreback commented Jan 20, 2020

lgtm. @WillAyd

@WillAyd WillAyd merged commit f1aaf62 into pandas-dev:master Jan 20, 2020
@WillAyd
Copy link
Member

WillAyd commented Jan 20, 2020

Thanks @charlesdong1991

# GH31016 this is to calculate margin for each group, and assign
# corresponded key as index
transformed_piece = DataFrame(piece.apply(aggfunc)).T
transformed_piece.index = Index([all_key], name=piece.index.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 what if piece.index is a MultiIndex? .name here will be None. Do we need to use .names somewhere?

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, it will be None if it is a MultiIndex. Need to have a deeper look to see it is reasonable to use .names because if using it, then that's a frozen list iirc, and we could not assign to index.name. Also if a MI here, then maybe also need to use MultiIndex instead of Index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pivot_table with multi-index columns only and margins=True gives wrong output or fails
6 participants