Skip to content

BUG: melt should preserve Categorical id_vars #15853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wavexx opened this issue Mar 31, 2017 · 7 comments · Fixed by #29510
Closed

BUG: melt should preserve Categorical id_vars #15853

wavexx opened this issue Mar 31, 2017 · 7 comments · Fixed by #29510
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@wavexx
Copy link

wavexx commented Mar 31, 2017

When using melt, I'd expect the type of the columns specified as id_vars to be preserved.
Categorical types seem to be lost in the process:

import pandas as pd
data = pd.DataFrame({'A': [1,2], 'B': pd.Categorical(['X', 'Y'])})
print(data)
print(data.info())
melted = pd.melt(data, ['B'], ['A'])
print(melted)
print(melted.info())

shows:

None
   B variable  value
0  X        A      1
1  Y        A      2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
B           2 non-null object
variable    2 non-null object
value       2 non-null int64
dtypes: int64(1), object(2)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-2-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: None
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.8.0.dev0+c906881
xarray: None
IPython: 5.1.0
sphinx: 1.4.9
patsy: 0.4.1+dev
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

@wavexx wavexx changed the title melt should preserve Categorical value_vars melt should preserve Categorical id_vars Mar 31, 2017
@jreback
Copy link
Contributor

jreback commented Mar 31, 2017

xref to #15785 this is the same causation (and lack of complete testing). pull-requests are welcome.

@jreback jreback added Bug Categorical Categorical Data Type Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 31, 2017
@jreback jreback added this to the Next Major Release milestone Mar 31, 2017
@jreback jreback changed the title melt should preserve Categorical id_vars BUG: melt should preserve Categorical id_vars Mar 31, 2017
@gregorylivschitz
Copy link

@jreback

I want to try my hand at this bug, it looks like when np.tile is called the categorical dtype is lost:

mdata[col] = np.tile(frame.pop(col).values, K)

I can check what the type of the col is and if it's categorical I can just cast it back. Do you think that's a valid solution?

@jreback
Copy link
Contributor

jreback commented Apr 1, 2017

so you have this

In [12]: s = Series(list('aabca'), dtype='category')

In [13]: s
Out[13]: 
0    a
1    a
2    b
3    c
4    a
dtype: category
Categories (3, object): [a, b, c]

and this is whats produced

In [14]: np.tile(s.values, 3)
Out[14]: 
array(['a', 'a', 'b', 'c', 'a', 'a', 'a', 'b', 'c', 'a', 'a', 'a', 'b',
       'c', 'a'], dtype=object)

FYI when you fix this bug it will also fix #15785

however, I think we should actually define Series.tile (very similar to Series.repeats). & define this on Categorical (which we will simply call), and in Categorical.tile do this.

In [19]: pd.Categorical.from_codes(np.tile(c.codes, 3), categories=c.categories, ordered=c.ordered)
Out[19]: 
[a, a, b, c, a, ..., a, a, b, c, a]
Length: 15
Categories (3, object): [a, b, c]

In [20]: Series(pd.Categorical.from_codes(np.tile(c.codes, 3), categories=c.categories, ordered=c.ordered))
Out[20]: 
0     a
1     a
2     b
3     c
4     a
5     a
6     a
7     b
8     c
9     a
10    a
11    a
12    b
13    c
14    a
dtype: category
Categories (3, object): [a, b, c]

lmk when you need more guidance. This is sort of 'trivial' to fix, but the right way of doing this is as I outlined above. Its a bit more code, but puts things in the correct places. And need tests for this (the new .tile, for Series and Catgorical :>

dsm054 added a commit to dsm054/pandas that referenced this issue Nov 13, 2018
Add support for tile and not simply repeat.
dsm054 added a commit to dsm054/pandas that referenced this issue Nov 13, 2018
Add support for tile and not simply repeat.
@dsm054
Copy link
Contributor

dsm054 commented Nov 13, 2018

@jreback: is that the sort of thing you had in mind, before I go writing any tests? It doesn't seem to affect #15785, though, which seems to not-crash for me already.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2018

it might be much better / fixed as categorical has seen much work recently

if fixed just some tests would be great (otherwise fix great too)

dsm054 added a commit to dsm054/pandas that referenced this issue Nov 13, 2018
Add support for tile and not simply repeat.
@dsm054
Copy link
Contributor

dsm054 commented Nov 13, 2018

Well, I'll put up a PR for this one and leave #15785 alone for now.

dsm054 added a commit to dsm054/pandas that referenced this issue Nov 13, 2018
Also add support for tile and not simply repeat.
dsm054 added a commit to dsm054/pandas that referenced this issue Nov 13, 2018
Also add support for tile and not simply repeat.
dsm054 added a commit to dsm054/pandas that referenced this issue Nov 13, 2018
Also add support for tile and not simply repeat.
@mroeschke
Copy link
Member

This looks fixed on master. Could use a test.

In [183]: melted = pd.melt(data, ['B'], ['A'])
     ...:

In [184]: melted.dtypes
Out[184]:
B           category
variable      object
value          int64
dtype: object

In [185]: pd.__version__
Out[185]: '0.26.0.dev0+555.gf7d162b18'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
6 participants