Skip to content

Categorical dtype doesn't survive groupby of first, max, min, value_counts etc.: unwanted coercion to object #18502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcolascione opened this issue Nov 26, 2017 · 1 comment · Fixed by #27071
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Enhancement Groupby
Milestone

Comments

@dcolascione
Copy link

Code Sample, a copy-pastable example if possible

# Your code here

    In [1]: df=pd.DataFrame(dict(payload=[-1,-2,-1,-2], col=pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)));df
    Out[1]: 
       col  payload
    0  foo       -1
    1  bar       -2
    2  bar       -1
    3  qux       -2

    In [2]: df.groupby("payload").first().col.dtype
    Out[2]: dtype('O')

Problem description

Grouping shouldn't coerce a categorical into object. Categorical dtypes should be preserved as long as possible for efficiency and correctness.

Expected Output

The result dtype should be CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True), just like it is here:

    In [6]: df.groupby("payload").head().col.dtype
    Out[6]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True

#### Output of ``pd.show_versions()``

<details>
    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 3.4.3.final.0
    python-bits: 64
    OS: Linux
    OS-release: 4.4.0-98-generic
    machine: x86_64
    processor: x86_64
    byteorder: little
    LC_ALL: None
    LANG: en_US.UTF-8
    LOCALE: en_US.UTF-8

    pandas: 0.21.0
    pytest: 3.2.5
    pip: 9.0.1
    setuptools: 36.5.0
    Cython: 0.20.1post0
    numpy: 1.13.3
    scipy: 0.13.3
    pyarrow: None
    xarray: 0.9.6
    IPython: 6.2.0
    sphinx: None
    patsy: None
    dateutil: 2.6.1
    pytz: 2017.3
    blosc: None
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.6.4
    feather: None
    matplotlib: 1.3.1
    openpyxl: None
    xlrd: None
    xlwt: None
    xlsxwriter: None
    lxml: None
    bs4: 4.2.1
    html5lib: 0.999
    sqlalchemy: 0.8.4
    pymysql: None
    psycopg2: None
    jinja2: 2.7.2
    s3fs: 0.1.2
    fastparquet: None
    pandas_gbq: None
    pandas_datareader: None
[paste the output of ``pd.show_versions()`` here below this line]

</details>
@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

I suppose. This would require a fair amount of work on this type of preservation as these routines are in cython. Would hold be able to work for the non-numerical filters (first,last,min,max). If you want to submit a PR would be accepted.

note 3.4 is not longer supported FYI.

@jreback jreback added Categorical Categorical Data Type Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions Enhancement Groupby labels Nov 26, 2017
@jreback jreback added this to the Next Major Release milestone Nov 26, 2017
@jreback jreback changed the title Categorical dtype doesn't survive groupby of first, max, min, etc.: unwanted coercion to object Categorical dtype doesn't survive groupby of first, max, min, value_counts etc.: unwanted coercion to object Jan 27, 2018
@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 May 28, 2019
jreback added a commit to jreback/pandas that referenced this issue May 29, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
jreback added a commit to jreback/pandas that referenced this issue May 29, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue May 29, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue May 29, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue May 30, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue Jun 2, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue Jun 8, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue Jun 9, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue Jun 21, 2019
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743
jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019
jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019
jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019
jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019
jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Enhancement Groupby
Projects
None yet
2 participants