Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #18030

jlandercy · 2017-10-30T10:44:50Z

Minimal Verifiable Working Example

Bellow you will find a Minimal Verifiable Working Example that reproduces the behaviour I am considering in this issue:

import pandas as pd
# JSON Dump for MWVE:
txt = """[{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null}]"""
# Load Data:
df = pd.read_json(txt)
# Filling NaN with string works as expected but downcast column types:
cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
# Trying to pivot data using MultiIndex and keeping columns of NaN produces all combinations of modalities:
cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)

Trial input looks like (df):

	channelid	measurandkey	networkkey	sitekey	timestamp	userfloatvalue
0	5069	Cu	HMT	01MEU1	2018-01-01	NaN
1	5069	Cu	HMT	01MEU1	2018-01-02	NaN
2	5069	Cu	HMT	01MEU1	2018-01-03	NaN
3	5069	Cu	HMT	01MEU1	2018-01-04	NaN
4	5119	Cu	HMT	01AND3	2018-01-01	NaN
5	5119	Cu	HMT	01AND3	2018-01-02	NaN
6	5119	Cu	HMT	01AND3	2018-01-03	NaN
7	5119	Cu	HMT	01AND3	2018-01-04	NaN
8	5120	Pb	HMT	01MEU1	2018-01-01	NaN
9	5120	Pb	HMT	01MEU1	2018-01-02	NaN
10	5120	Pb	HMT	01MEU1	2018-01-03	NaN
11	5120	Pb	HMT	01MEU1	2018-01-04	NaN
12	5233	Pb	HMT	01AND3	2018-01-01	NaN
13	5233	Pb	HMT	01AND3	2018-01-02	NaN
14	5233	Pb	HMT	01AND3	2018-01-03	NaN
15	5233	Pb	HMT	01AND3	2018-01-04	NaN

Misbehaved output looks like (cross3):

channelid	5069				5119				5120				5233
networkkey	HMT				HMT				HMT				HMT
sitekey	01AND3		01MEU1		01AND3		01MEU1		01AND3		01MEU1		01AND3		01MEU1
measurandkey	Cu	Pb	Cu	Pb	Cu	Pb	Cu	Pb	Cu	Pb	Cu	Pb	Cu	Pb	Cu	Pb
timestamp
2018-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-03	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Expected output is similar to cross2 but with NaN value instead of string and looks like:

channelid	5069	5119	5120	5233
networkkey	HMT	HMT	HMT	HMT
sitekey	01MEU1	01AND3	01MEU1	01AND3
measurandkey	Cu	Cu	Pb	Pb
timestamp
2018-01-01	nodata	nodata	nodata	nodata
2018-01-02	nodata	nodata	nodata	nodata
2018-01-03	nodata	nodata	nodata	nodata
2018-01-04	nodata	nodata	nodata	nodata

Problem description

I have the need:

to use MultiIndex in columns even if it is overdetermined (I mean, with less levels, index is still unique) and;
to keep columns full of NaN because it means the channel lacks all its data.

What seems to be the problem, is the creation of all combination of level modalities (instead of keep the existing one only) which drastically increases the amount of Memory without necessity (those combinations are not present in original data).

Maybe it is a bug, maybe it is the designed behaviour. Just wanted to notice it because it has surprised me, and now I am looking to a clean way to circonvolve this behaviour.

How have I found it:

I first had a Memory Error with small queries (about 1000 rows and 25 channels), then I reduced the amount of rows and columns, and I finally dumped it to JSON in order to get the following MVWE above.

Expected Output

To my understanding, the following command:

cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)

Should return the same as:

cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
cross2bis = cross2.replace('nodata', float('nan'))

A small DataFrame with no extra columns and NaN value not dropped, it should look like:

channelid	5069	5119	5120	5233
networkkey	HMT	HMT	HMT	HMT
sitekey	01MEU1	01AND3	01MEU1	01AND3
measurandkey	Cu	Cu	Pb	Pb
timestamp
2018-01-01	NaN	NaN	NaN	NaN
2018-01-02	NaN	NaN	NaN	NaN
2018-01-03	NaN	NaN	NaN	NaN
2018-01-04	NaN	NaN	NaN	NaN

Without generating combination of level modalities that does not exists in input data.
This will also prevent raising a MemoryError for reasonable amount of data.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jlandercy · 2017-10-30T12:49:55Z

Diving into Pandas source code I found why it behaves like described above.

In DataFrame definition we find a method binding to pivot_table function:

pandas/pandas/core/frame.py

Lines 4430 to 4437 in 52fe6bc

    
           def pivot_table(self, values=None, index=None, columns=None, 
        
                           aggfunc='mean', fill_value=None, margins=False, 
        
                           dropna=True, margins_name='All'): 
        
               from pandas.core.reshape.pivot import pivot_table 
        
               return pivot_table(self, values=values, index=index, columns=columns, 
        
                                  aggfunc=aggfunc, fill_value=fill_value, 
        
                                  margins=margins, dropna=dropna, 
        
                                  margins_name=margins_name)

In the function definition:

pandas/pandas/core/reshape/pivot.py

Lines 26 to 28 in 52fe6bc

    
           def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', 
        
                           fill_value=None, margins=False, dropna=True, 
        
                           margins_name='All'):

We can search for dropna switch and see that there is a Cartesian Product of level modalities:

pandas/pandas/core/reshape/pivot.py

Lines 99 to 113 in 52fe6bc

    
           if not dropna: 
        
               from pandas import MultiIndex 
        
               try: 
        
                   m = MultiIndex.from_arrays(cartesian_product(table.index.levels), 
        
                                              names=table.index.names) 
        
                   table = table.reindex(m, axis=0) 
        
               except AttributeError: 
        
                   pass  # it's a single level 
        
               try: 
        
                   m = MultiIndex.from_arrays(cartesian_product(table.columns.levels), 
        
                                              names=table.columns.names) 
        
                   table = table.reindex(m, axis=1) 
        
               except AttributeError: 
        
                   pass  # it's a single level or a series

And after all, columns of NaN are purged:

pandas/pandas/core/reshape/pivot.py

Lines 136 to 140 in 52fe6bc

    
           # GH 15193 Make sure empty columns are removed if dropna=True 
        
           if isinstance(table, ABCDataFrame) and dropna: 
        
               table = table.dropna(how='all', axis=1) 
        
           return table

After reading these portions of code, I can tell:

It is the intended behaviour;
There is no way to keep NaN and not to increase cardinality of MultiIndex with the dropna switch, because of the Cartesian Product part.

Questions:

I wonder why does this behaviour exist, how it is used;
And how can I perform my task without filling and then replacing.

Fix proposal:

Actually I need to inhibate the final dropna without triggering the Cartesian Product. And then I guess it is done.

May be a finer condition (or an extra switch) in the Cartesian Product part will do the trick.

jorisvandenbossche · 2017-10-30T13:23:38Z

@jlandercy thanks for diving in!
My first reflex was to say that this is due to limitations of unstack dealing with existing NaNs vs introduced NaNs due to unstacking (you can search for 'unstack dropna' in the issues to see some related discussions).

But actually, it seems you can get your desired result (I think) with the underlying groupby + unstack:

In [83]: df.groupby(["timestamp", "channelid", "networkkey", "sitekey", "measurandkey"]).agg('first').unstack([1,2,3,4])
Out[83]: 
             userfloatvalue                     
channelid              5069   5119   5120   5233
networkkey              HMT    HMT    HMT    HMT
sitekey              01MEU1 01AND3 01MEU1 01AND3
measurandkey             Cu     Cu     Pb     Pb
timestamp                                       
2018-01-01              NaN    NaN    NaN    NaN
2018-01-02              NaN    NaN    NaN    NaN
2018-01-03              NaN    NaN    NaN    NaN
2018-01-04              NaN    NaN    NaN    NaN

Is it correct that this is what you want?

jlandercy · 2017-10-30T14:12:29Z

Dear @jorisvandenbossche,

Yes this is the desired output. Thank you for pointing out this alternative of fill/replace.

Unfortunately, this will not fit in my project as this. Simply because I extensively use pivot_table with a lot of parametrizations. Your solution will introduce a new dataflow I must handle in parallel with pivot_table instead of properly setup its parameters. Or maybe I will have to redesign my code to cope with this issue. But I fell odd about that for the moment. I do like the simplicity of pivot_table method.

What I do not understand is: why there is a Cartesian Product on Level Modalities of index and columns? Do you have any idea? What could be the reason or usage of this feature? I am really intrigued with this. And I found it a little bit counter intuitive by now.

Best regards,

Sarnath · 2019-01-24T11:49:54Z

Yes, I hit this problem today. Boy -- So annoying.
We use pivot_table everywhere and this behavior is very limiting.
Is there any ETA for this fix?
Good job by OP in creating a reproducible case... Great job, mate!

jreback · 2019-01-24T12:14:53Z

ETA is when someone submits a patch, how about it @Sarnath

Sarnath · 2019-01-24T14:20:45Z

Challenge accepted. I have no idea about Pandas development. Let me know how to go about it. I will get started! Cheers!

Sarnath · 2019-01-24T14:21:58Z

https://pandas.pydata.org/pandas-docs/stable/contributing.html
I will go with this. Let me know if something else is expected.

TomAugspurger · 2019-01-24T14:28:33Z

The docs at `/stable` are a bit out of date. Try http://pandas-docs.github.io/pandas-docs-travis/development/contributing.html

…

On Thu, Jan 24, 2019 at 8:22 AM Sarnath ***@***.***> wrote: https://pandas.pydata.org/pandas-docs/stable/contributing.html I will go with this. Let me know if something else is expected. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#18030 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIpNh-lo-Lkffo7n_AIs3Rw-a3UB6ks5vGcGNgaJpZM4QK-7n> .

Sarnath · 2019-01-25T09:54:24Z

Sure, Thanks! I will start in a week. I hope i can find a place where I can sync with other developers. I will first read the guide. Thanks for your time!

marchezinixd · 2019-08-28T13:57:00Z

Any news on this?

TomAugspurger · 2019-08-30T21:51:46Z

Still open, and will be closed when it's fixed. @marchezinixd are you interested in working on it?

seden · 2020-01-23T13:42:41Z

Is there any update on this? Can the cartesian product be turned into a generator so that the memory allocation won't fail? Also, is the reindexing at that point necessary when dropna is False?

ekerstein · 2021-05-08T19:46:52Z

I'm also having this issue. I resorted to replace missing column names with "" so that they're included in the pivot_table.

gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Oct 30, 2017

endremborza mentioned this issue Sep 19, 2019

Pivot nans fix #28540

Closed

5 tasks

rumbin mentioned this issue Oct 22, 2019

Pivot table null values in columns apache/superset#7406

Closed

3 tasks

mroeschke added Bug and removed Usage Question labels Jun 12, 2021

zhaoyongjie mentioned this issue Jul 30, 2021

Time Series viz shows incorrect legend for multiple group by columns apache/superset#15956

Closed

3 tasks

randolf-scholz pushed a commit to randolf-scholz/pandas that referenced this issue Feb 15, 2022

fix pandas-dev#18030

b63e452

randolf-scholz mentioned this issue Feb 15, 2022

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #45994

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #18030

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #18030

jlandercy commented Oct 30, 2017 •

edited

Loading

jlandercy commented Oct 30, 2017 •

edited

Loading

jorisvandenbossche commented Oct 30, 2017

jlandercy commented Oct 30, 2017 •

edited

Loading

Sarnath commented Jan 24, 2019

jreback commented Jan 24, 2019

Sarnath commented Jan 24, 2019

Sarnath commented Jan 24, 2019

TomAugspurger commented Jan 24, 2019 via email

Sarnath commented Jan 25, 2019

marchezinixd commented Aug 28, 2019

TomAugspurger commented Aug 30, 2019

seden commented Jan 23, 2020 •

edited

Loading

ekerstein commented May 8, 2021

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #18030

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only #18030

Comments

jlandercy commented Oct 30, 2017 • edited Loading

Minimal Verifiable Working Example

Problem description

Expected Output

Output of pd.show_versions()

jlandercy commented Oct 30, 2017 • edited Loading

jorisvandenbossche commented Oct 30, 2017

jlandercy commented Oct 30, 2017 • edited Loading

Sarnath commented Jan 24, 2019

jreback commented Jan 24, 2019

Sarnath commented Jan 24, 2019

Sarnath commented Jan 24, 2019

TomAugspurger commented Jan 24, 2019 via email

Sarnath commented Jan 25, 2019

marchezinixd commented Aug 28, 2019

TomAugspurger commented Aug 30, 2019

seden commented Jan 23, 2020 • edited Loading

ekerstein commented May 8, 2021

jlandercy commented Oct 30, 2017 •

edited

Loading

Output of `pd.show_versions()`

jlandercy commented Oct 30, 2017 •

edited

Loading

jlandercy commented Oct 30, 2017 •

edited

Loading

seden commented Jan 23, 2020 •

edited

Loading