-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @randolf-scholz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-02-15 02:23:56 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suspect we have tests which hit this
@jreback I think one could just drop the tests if all they do is check for the behaviour that I think should be deprecated. I had a look at the first failure. The base table is:
and it computes
and
and then just checks whether I would argue that actually neither behaviour is documented nor logical. In This behaviour should get deprecated as well; the reason I restored it in my 2nd commit is because completely removing it causes that too much gets deleted |
@jreback If a user really wants the full Cartesian product, I think it would make more sense to offer a routine that extends an existing Again, because one cannot assume a-priori that the full Cartesian product of the level values makes semantic sense, the pivot table should operate conservatively and not add these. This also makes it more efficient. I had a case where the current behaviour bloated the size and runtime of the pivot table by a factor of 10. |
@jreback So the actual pivot tables that should be generated in the example are:
and
|
Also, as a side note for a potential future issue: it should be noted that the pivot operation converts the integers to float here. It would make more sense to keep the data type of the |
this is breaking a lot of tests |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge the main branch, address the review comments, and we can reopen. |
There was already another proposed fix #28540 which never concluded. I propose a more straightforward solution to simply deprecate the current behaviour, for the following reasons:
pivot_table
says:pivot_table
should return more rows than what is given by the index input, ifindex
,columns
andvalues
are all strict, pairwise disjoint subsets of the original tables columns+index-columns.If one calls
df.pivot_table(index=df.index.names, columns=..., values=..., dropna=False)
, it currently creates a table withproduct(df.index.names)
many rows. However, the documentation specifies for the inputindex
:Which of course is a bit self-contradictory, but the gist is: if
len(index)=len(values)
, then useindex
as the new index, else, use the columns looked up viadf[index]
. This documented behaviour is violated by the current, unlogical behaviour whendropna=False