Skip to content

BUG: astype(CategoricalDtype) has no effect #15078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arita37 opened this issue Jan 7, 2017 · 9 comments
Closed

BUG: astype(CategoricalDtype) has no effect #15078

arita37 opened this issue Jan 7, 2017 · 9 comments
Labels
Bug Categorical Categorical Data Type
Milestone

Comments

@arita37
Copy link

arita37 commented Jan 7, 2017

The following fails silently to convert the dtype:

In [14]: s = pd.Series(['a', 'b', 'c'])

In [15]: s
Out[15]: 
0    a
1    b
2    c
dtype: object

In [16]: s.astype(pd.types.dtypes.CategoricalDtype)
Out[16]: 
0    a
1    b
2    c

I would think this either to work or either to raise an error that the dtype is not undertood.


Code Sample

When using dictionnary of dtype, it does not convert the dataframe :
Types are not modified whereas df['col']= df['col].astype(type1) works....

dtype0= {'brand': np.dtype('int64'),
 'category': np.dtype('int64'),
 'chain': np.dtype('int64'),
 'company': np.dtype('int64'),
 'date': np.dtype('O'),
 'dept':  pandas.types.dtypes.CategoricalDtype,
 'id': np.dtype('int64')}
df= df.astype(dtype0)
df.dtypes

Problem description

When using dictionnary of dtype, it does not convert the dataframe :
Types are not modified whereas df['col']= df['col].astype(type1) works....

Expected Output

columns are converted into the desire types.

Output of pd.show_versions()

Pandas 0.19.2

# Paste the output here pd.show_versions() here
@jorisvandenbossche
Copy link
Member

Can you show a reproducible example that shows the problem?

For me this works with a simple example:

In [1]: pd.__version__
Out[1]: '0.19.2'

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':['a','b', 'c']})

In [3]: df.dtypes
Out[3]: 
a     int64
b    object
dtype: object

In [4]:  df.astype({'a': 'float64', 'b': 'category'}).dtypes
Out[4]: 
a     float64
b    category
dtype: object

@jorisvandenbossche jorisvandenbossche added the Needs Info Clarification about behavior needed to assess issue label Jan 7, 2017
@arita37
Copy link
Author

arita37 commented Jan 7, 2017

Your solution is working.
It seems pandas.types.dtypes.CategoricalDtype is not recognized when doing the casting,
so better to use 'category'

@arita37 arita37 closed this as completed Jan 7, 2017
@jorisvandenbossche jorisvandenbossche changed the title Bug: dtypes conversion issue BUG: astype(CategoricalDtype) has no effect Jan 7, 2017
@jorisvandenbossche
Copy link
Member

Hmm, that should actually work I think. Or otherwise an error. I reopened the issue and updated the top post with an example.

@jorisvandenbossche jorisvandenbossche added Bug Categorical Categorical Data Type and removed Needs Info Clarification about behavior needed to assess issue labels Jan 7, 2017
@jreback
Copy link
Contributor

jreback commented Jan 7, 2017

maybe though this is an internal type (and not directly exposed to the user)

@jorisvandenbossche
Copy link
Member

Yeah, and it should not necessarily work for me, but if we don't accept it as a dtype, then it should raise an error IMO.

@jreback
Copy link
Contributor

jreback commented Jan 7, 2017

this is with our new default
errors='raise'?

@jorisvandenbossche
Copy link
Member

It does not depend on that (and the default was already to raise, the keyword name changed but not the default value).

The reason is that the provided dtype is converted to a numpy dtype:

In [24]: np.dtype(list)
Out[24]: dtype('O')

In [25]: np.dtype(pd.types.dtypes.CategoricalDtype)
Out[25]: dtype('O')

For that reason also something like s.astype(list) does work but does not do anything (for an object series).

It can just a bit confusing to users I think, as an instantiated CategoricalDtype actually works:

In [26]: s.astype(pd.types.dtypes.CategoricalDtype())
Out[26]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

In [27]: s.astype(pd.types.dtypes.CategoricalDtype)
Out[27]: 
0    a
1    b
2    c
dtype: object

which comes down to this difference:

In [28]: pd.types.common.is_categorical_dtype(pd.types.dtypes.CategoricalDtype)
Out[28]: False

In [29]: pd.types.common.is_categorical_dtype(pd.types.dtypes.CategoricalDtype())
Out[29]: True

@arita37
Copy link
Author

arita37 commented Jan 8, 2017 via email

@TomAugspurger
Copy link
Contributor

We spent some time to find where is the real object of 'category' as
this is not really mentionned in the documentation.

At the moment, the CategoricalDtype isn't part of the public API. There are plans to refactor it a bit before exposing it.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 25, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 30, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 31, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 6, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 10, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 17, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 20, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
@jreback jreback added this to the 0.21.0 milestone Sep 23, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
jreback pushed a commit that referenced this issue Sep 23, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

No branches or pull requests

4 participants