Skip to content

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

randolf-scholz
Copy link
Contributor

There was already another proposed fix #28540 which never concluded. I propose a more straightforward solution to simply deprecate the current behaviour, for the following reasons:

If one calls df.pivot_table(index=df.index.names, columns=..., values=..., dropna=False), it currently creates a table with product(df.index.names) many rows. However, the documentation specifies for the input index:

If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values

Which of course is a bit self-contradictory, but the gist is: if len(index)=len(values), then use index as the new index, else, use the columns looked up via df[index]. This documented behaviour is violated by the current, unlogical behaviour when dropna=False

@pep8speaks
Copy link

pep8speaks commented Feb 15, 2022

Hello @randolf-scholz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-02-15 02:23:56 UTC

@jreback jreback changed the title fix #18030 Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only Feb 15, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suspect we have tests which hit this

@randolf-scholz
Copy link
Contributor Author

@jreback I think one could just drop the tests if all they do is check for the behaviour that I think should be deprecated. I had a look at the first failure. The base table is:

amount customer month product quantity
0 60000 A 201307 a 2000000
1 100000 A 201309 b 500000
2 50000 B 201308 c 1000000
3 30000 C 201310 d 1000000

and it computes pv_col = df.pivot_table("quantity", "month", ["customer", "product"], dropna=False)

customer A B C
product a b c d a b c d a b c d
month
201307 2000000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
201308 NaN NaN NaN NaN NaN NaN 1000000.0 NaN NaN NaN NaN NaN
201309 NaN 500000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
201310 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1000000.0

and pv_ind = df.pivot_table("quantity", ["customer", "product"], "month", dropna=False)

month 201307 201308 201309 201310
customer product
A a 2000000.0 NaN NaN NaN
b NaN NaN 500000.0 NaN
c NaN NaN NaN NaN
d NaN NaN NaN NaN
B a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN 1000000.0 NaN NaN
d NaN NaN NaN NaN
C a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN
d NaN NaN NaN 1000000.0

and then just checks whether pv_col.columns and pv_ind.index is equal to the Cartesian product.

I would argue that actually neither behaviour is documented nor logical. In pv_col, the MultiIndex tuples (C,a), (C,b) and (C,c) appear, which are not present in the original data and might not even make semantic sense.

This behaviour should get deprecated as well; the reason I restored it in my 2nd commit is because completely removing it causes that too much gets deleted

@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented Feb 15, 2022

@jreback If a user really wants the full Cartesian product, I think it would make more sense to offer a routine that extends an existing MultiIndex to the full Cartesian product of the levels, if such a thing does not exist already.

Again, because one cannot assume a-priori that the full Cartesian product of the level values makes semantic sense, the pivot table should operate conservatively and not add these. This also makes it more efficient. I had a case where the current behaviour bloated the size and runtime of the pivot table by a factor of 10.

@randolf-scholz
Copy link
Contributor Author

@jreback So the actual pivot tables that should be generated in the example are:

customer A B C
product a b c d
month
201307 2000000.0 NaN NaN NaN
201308 NaN NaN 1000000.0 NaN
201309 NaN 500000.0 NaN NaN
201310 NaN NaN NaN 1000000.0

and

month 201307 201308 201309 201310
customer product
A a 2000000.0 NaN NaN NaN
b NaN NaN 500000.0 NaN
B c NaN 1000000.0 NaN NaN
C d NaN NaN NaN 1000000.0

@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented Feb 15, 2022

Also, as a side note for a potential future issue: it should be noted that the pivot operation converts the integers to float here. It would make more sense to keep the data type of the values columns and use pandas nullable data types to add the missing values instead.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2022

this is breaking a lot of tests

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 30, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge the main branch, address the review comments, and we can reopen.

@mroeschke mroeschke closed this Apr 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants