Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994

randolf-scholz · 2022-02-15T02:12:19Z

closes Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #18030

There was already another proposed fix #28540 which never concluded. I propose a more straightforward solution to simply deprecate the current behaviour, for the following reasons:

The current behaviour was never documented. pivot_table says:
- dropna: bool, default True: Do not include columns whose entries are all NaN.
- However the deprecated behaviour is about adding rows.
In the discussion of Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #18030 no one could explain the purpose of the Cartesian product, and it is unlogical, there is no case where pivot_table should return more rows than what is given by the index input, if index, columns and values are all strict, pairwise disjoint subsets of the original tables columns+index-columns.

If one calls df.pivot_table(index=df.index.names, columns=..., values=..., dropna=False), it currently creates a table with product(df.index.names) many rows. However, the documentation specifies for the input index:

If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values

Which of course is a bit self-contradictory, but the gist is: if len(index)=len(values), then use index as the new index, else, use the columns looked up via df[index]. This documented behaviour is violated by the current, unlogical behaviour when dropna=False

pep8speaks · 2022-02-15T02:12:21Z

Hello @randolf-scholz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-02-15 02:23:56 UTC

jreback

i suspect we have tests which hit this

randolf-scholz · 2022-02-15T03:17:23Z

@jreback I think one could just drop the tests if all they do is check for the behaviour that I think should be deprecated. I had a look at the first failure. The base table is:

	amount	customer	month	product	quantity
0	60000	A	201307	a	2000000
1	100000	A	201309	b	500000
2	50000	B	201308	c	`1000000`
3	30000	C	201310	d	`1000000`

and it computes pv_col = df.pivot_table("quantity", "month", ["customer", "product"], dropna=False)

customer	A				B				C
product	a	b	c	d	a	b	c	d	a	b	c	d
month
201307	2000000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
201308	NaN	NaN	NaN	NaN	NaN	NaN	`1000000`.0	NaN	NaN	NaN	NaN	NaN
201309	NaN	500000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
201310	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	`1000000`.0

and pv_ind = df.pivot_table("quantity", ["customer", "product"], "month", dropna=False)

	month	201307	201308	201309	201310
customer	product
A	a	2000000.0	NaN	NaN	NaN
	b	NaN	NaN	500000.0	NaN
	c	NaN	NaN	NaN	NaN
	d	NaN	NaN	NaN	NaN
B	a	NaN	NaN	NaN	NaN
	b	NaN	NaN	NaN	NaN
	c	NaN	`1000000`.0	NaN	NaN
	d	NaN	NaN	NaN	NaN
C	a	NaN	NaN	NaN	NaN
	b	NaN	NaN	NaN	NaN
	c	NaN	NaN	NaN	NaN
	d	NaN	NaN	NaN	`1000000`.0

and then just checks whether pv_col.columns and pv_ind.index is equal to the Cartesian product.

I would argue that actually neither behaviour is documented nor logical. In pv_col, the MultiIndex tuples (C,a), (C,b) and (C,c) appear, which are not present in the original data and might not even make semantic sense.

This behaviour should get deprecated as well; the reason I restored it in my 2nd commit is because completely removing it causes that too much gets deleted

randolf-scholz · 2022-02-15T03:23:22Z

@jreback If a user really wants the full Cartesian product, I think it would make more sense to offer a routine that extends an existing MultiIndex to the full Cartesian product of the levels, if such a thing does not exist already.

Again, because one cannot assume a-priori that the full Cartesian product of the level values makes semantic sense, the pivot table should operate conservatively and not add these. This also makes it more efficient. I had a case where the current behaviour bloated the size and runtime of the pivot table by a factor of 10.

randolf-scholz · 2022-02-15T03:29:08Z

@jreback So the actual pivot tables that should be generated in the example are:

customer	A		B	C
product	a	b	c	d
month
201307	2000000.0	NaN	NaN	NaN
201308	NaN	NaN	`1000000`.0	NaN
201309	NaN	500000.0	NaN	NaN
201310	NaN	NaN	NaN	`1000000`.0

and

	month	201307	201308	201309	201310
customer	product
A	a	2000000.0	NaN	NaN	NaN
A	b	NaN	NaN	500000.0	NaN
B	c	NaN	`1000000`.0	NaN	NaN
C	d	NaN	NaN	NaN	`1000000`.0

randolf-scholz · 2022-02-15T03:31:44Z

Also, as a side note for a potential future issue: it should be noted that the pivot operation converts the integers to float here. It would make more sense to keep the data type of the values columns and use pandas nullable data types to add the missing values instead.

jreback · 2022-02-27T20:13:16Z

this is breaking a lot of tests

github-actions · 2022-03-30T00:04:52Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2022-04-24T02:59:51Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge the main branch, address the review comments, and we can reopen.

fix pandas-dev#18030

b63e452

jreback changed the title ~~fix #18030~~ Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only Feb 15, 2022

restored behaviour for COLUMNS

b016463

jreback requested changes Feb 15, 2022

View reviewed changes

github-actions bot added the Stale label Mar 30, 2022

mroeschke closed this Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994

randolf-scholz commented Feb 15, 2022

pep8speaks commented Feb 15, 2022 •

edited

Loading

jreback left a comment

randolf-scholz commented Feb 15, 2022

randolf-scholz commented Feb 15, 2022 •

edited

Loading

randolf-scholz commented Feb 15, 2022

randolf-scholz commented Feb 15, 2022 •

edited

Loading

jreback commented Feb 27, 2022

github-actions bot commented Mar 30, 2022

mroeschke commented Apr 24, 2022

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994

Conversation

randolf-scholz commented Feb 15, 2022

pep8speaks commented Feb 15, 2022 • edited Loading

Comment last updated at 2022-02-15 02:23:56 UTC

jreback left a comment

Choose a reason for hiding this comment

randolf-scholz commented Feb 15, 2022

randolf-scholz commented Feb 15, 2022 • edited Loading

randolf-scholz commented Feb 15, 2022

randolf-scholz commented Feb 15, 2022 • edited Loading

jreback commented Feb 27, 2022

github-actions bot commented Mar 30, 2022

mroeschke commented Apr 24, 2022

pep8speaks commented Feb 15, 2022 •

edited

Loading

randolf-scholz commented Feb 15, 2022 •

edited

Loading

randolf-scholz commented Feb 15, 2022 •

edited

Loading