Skip to content

pivot_table over Categorical columns #15193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kevin-McIsaac opened this issue Jan 22, 2017 · 4 comments
Closed

pivot_table over Categorical columns #15193

Kevin-McIsaac opened this issue Jan 22, 2017 · 4 comments
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@Kevin-McIsaac
Copy link

Code Sample, a copy-pastable example if possible

stations = ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                 'Town Hall Station', 'Central Station', 'Circular Quay Station', 
                 'Martin Place Station', 'Museum Station', 'St James Station', 
                 'Bondi Junction Station', 'North Sydney Station']

df = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                               'Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                               'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
                   'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
                                             '2/1/2017', '2/1/2017', '2/1/2017',
                                             '3/1/2017', '3/1/2017', '3/1/2017',]),
                   'Exit': range(0, 9)})

df.Station = df.Station.astype('category', ordered=True, categories=stations)
df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)

Problem description

When the column is a Categorical the output of pivot_table

  1. Includes columns that are all NaN which should not be the case as dropna=True
  2. Includes columns for categories that aren't in the input Data Frame, which is strange.
  3. Is not the same as when the output before the column is converted to a categorical

This was not the behaviour in earlier versions (18?)

Expected Output

Should be the same as the output when the column is not a categorical

Station Kings Cross Station Newtown Station Parramatta Station
Date
2017-01-01 0 1 2
2017-02-01 3 4 5
2017-03-01 6 7 8

Output of pd.show_versions()

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 23, 2017

we had this exact discussion here: #12298

with a categorical

In[1]: df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)
Out[1]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station  Town Hall Station  Central Station  Circular Quay Station  Martin Place Station  Museum Station  St James Station  \
Date                                                                                                                                                                                      
2017-01-01                  0.0              1.0                 2.0                NaN              NaN                    NaN                   NaN             NaN               NaN   
2017-02-01                  3.0              4.0                 5.0                NaN              NaN                    NaN                   NaN             NaN               NaN   
2017-03-01                  6.0              7.0                 8.0                NaN              NaN                    NaN                   NaN             NaN               NaN   

Station     Bondi Junction Station  North Sydney Station  
Date                                                      
2017-01-01                     NaN                   NaN  
2017-02-01                     NaN                   NaN  
2017-03-01                     NaN                   NaN  

as object

In [4]: df.assign(Station = df.Station.astype(object)).pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)
Out[4]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                    0                1                   2
2017-02-01                    3                4                   5
2017-03-01                    6                7                   8

so responding to your points

  1. Includes columns that are all NaN which should not be the case as dropna=True

I suppose this is a bug as this is the point of dropna , so this appears not to be working, e.g.
is the expected result

In [14]: df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit').dropna(how='all', axis=1)
Out[14]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                  0.0              1.0                 2.0
2017-02-01                  3.0              4.0                 5.0
2017-03-01                  6.0              7.0                 8.0
  1. Includes columns for categories that aren't in the input Data Frame, which is strange.

this is by definition and the point of a categorical, you get all of the categories. furthermore these are specifically ordered.

  1. Is not the same as when the output before the column is converted to a categorical

not sure what you mean here

@jreback jreback added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 23, 2017
@jreback
Copy link
Contributor

jreback commented Jan 23, 2017

So I think that we can mark this as a bug w.r.t. point 1 above, IOW, dropna is not doing anything here as its applied to drop the all-na columns.

@jorisvandenbossche

@jreback jreback added Bug Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 23, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 23, 2017
@verhalenn
Copy link

Would you be opposed to a simple

if isinstance(table, DataFrame) and dropna: 
    table = table.dropna(how='all', axis=1)

added to the end of the function?

It passes all current tests.

@jreback
Copy link
Contributor

jreback commented Feb 23, 2017

@verhalenn that might work. of course need new tests as well.

@jreback jreback closed this as completed in ed2a2e4 Mar 4, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#15193

Author: Nicholas Ver Halen <[email protected]>

Closes pandas-dev#15511 from verhalenn/issue15193 and squashes the following commits:

bf0fdeb [Nicholas Ver Halen] Added description to code change.
adf8616 [Nicholas Ver Halen] Added whatsnew for issue 15193
a643267 [Nicholas Ver Halen] Added test for issue 15193
d605251 [Nicholas Ver Halen] Made sure pivot_table propped na columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants