pivot_table over Categorical columns #15193

Kevin-McIsaac · 2017-01-22T23:24:09Z

Code Sample, a copy-pastable example if possible

stations = ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                 'Town Hall Station', 'Central Station', 'Circular Quay Station', 
                 'Martin Place Station', 'Museum Station', 'St James Station', 
                 'Bondi Junction Station', 'North Sydney Station']

df = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                               'Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                               'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
                   'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
                                             '2/1/2017', '2/1/2017', '2/1/2017',
                                             '3/1/2017', '3/1/2017', '3/1/2017',]),
                   'Exit': range(0, 9)})

df.Station = df.Station.astype('category', ordered=True, categories=stations)
df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)

Problem description

When the column is a Categorical the output of pivot_table

Includes columns that are all NaN which should not be the case as dropna=True
Includes columns for categories that aren't in the input Data Frame, which is strange.
Is not the same as when the output before the column is converted to a categorical

This was not the behaviour in earlier versions (18?)

Expected Output

Should be the same as the output when the column is not a categorical

Station Kings Cross Station Newtown Station Parramatta Station
Date
2017-01-01 0 1 2
2017-02-01 3 4 5
2017-03-01 6 7 8

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-23T00:57:15Z

we had this exact discussion here: #12298

with a categorical

In[1]: df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)
Out[1]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station  Town Hall Station  Central Station  Circular Quay Station  Martin Place Station  Museum Station  St James Station  \
Date                                                                                                                                                                                      
2017-01-01                  0.0              1.0                 2.0                NaN              NaN                    NaN                   NaN             NaN               NaN   
2017-02-01                  3.0              4.0                 5.0                NaN              NaN                    NaN                   NaN             NaN               NaN   
2017-03-01                  6.0              7.0                 8.0                NaN              NaN                    NaN                   NaN             NaN               NaN   

Station     Bondi Junction Station  North Sydney Station  
Date                                                      
2017-01-01                     NaN                   NaN  
2017-02-01                     NaN                   NaN  
2017-03-01                     NaN                   NaN

as object

In [4]: df.assign(Station = df.Station.astype(object)).pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)
Out[4]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                    0                1                   2
2017-02-01                    3                4                   5
2017-03-01                    6                7                   8

so responding to your points

Includes columns that are all NaN which should not be the case as dropna=True

I suppose this is a bug as this is the point of dropna , so this appears not to be working, e.g.
is the expected result

In [14]: df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit').dropna(how='all', axis=1)
Out[14]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                  0.0              1.0                 2.0
2017-02-01                  3.0              4.0                 5.0
2017-03-01                  6.0              7.0                 8.0

Includes columns for categories that aren't in the input Data Frame, which is strange.

this is by definition and the point of a categorical, you get all of the categories. furthermore these are specifically ordered.

Is not the same as when the output before the column is converted to a categorical

not sure what you mean here

jreback · 2017-01-23T01:00:36Z

So I think that we can mark this as a bug w.r.t. point 1 above, IOW, dropna is not doing anything here as its applied to drop the all-na columns.

@jorisvandenbossche

verhalenn · 2017-02-23T19:54:04Z

Would you be opposed to a simple

if isinstance(table, DataFrame) and dropna: 
    table = table.dropna(how='all', axis=1)

added to the end of the function?

It passes all current tests.

jreback · 2017-02-23T19:59:43Z

@verhalenn that might work. of course need new tests as well.

closes pandas-dev#15193 Author: Nicholas Ver Halen <[email protected]> Closes pandas-dev#15511 from verhalenn/issue15193 and squashes the following commits: bf0fdeb [Nicholas Ver Halen] Added description to code change. adf8616 [Nicholas Ver Halen] Added whatsnew for issue 15193 a643267 [Nicholas Ver Halen] Added test for issue 15193 d605251 [Nicholas Ver Halen] Made sure pivot_table propped na columns

jreback added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 23, 2017

jreback added Bug Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 23, 2017

jreback added this to the 0.20.0 milestone Jan 23, 2017

verhalenn mentioned this issue Feb 26, 2017

BUG: pivot_table over Categorical Columns #15511

Closed

3 tasks

jreback closed this as completed in ed2a2e4 Mar 4, 2017

peisenha mentioned this issue May 16, 2017

pd.crosstab, categorical data and missing instances #16367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pivot_table over Categorical columns #15193

pivot_table over Categorical columns #15193

Kevin-McIsaac commented Jan 22, 2017

jreback commented Jan 23, 2017 •

edited

Loading

jreback commented Jan 23, 2017

verhalenn commented Feb 23, 2017

jreback commented Feb 23, 2017

pivot_table over Categorical columns #15193

pivot_table over Categorical columns #15193

Comments

Kevin-McIsaac commented Jan 22, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Jan 23, 2017 • edited Loading

jreback commented Jan 23, 2017

verhalenn commented Feb 23, 2017

jreback commented Feb 23, 2017

Output of `pd.show_versions()`

jreback commented Jan 23, 2017 •

edited

Loading