Skip to content

Regression: "invalid type <class 'pandas.core.dtypes.dtypes.CategoricalDtype'> for astype" #17780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue Oct 4, 2017 · 10 comments
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Oct 4, 2017

Sorry in advance if this is actually desired behaviour. The whatsnew/v0.21.0.txt only states

"passing categories or ordered kwargs to :func:Series.astype is deprecated, in favor of passing a :ref:CategoricalDtype <whatsnew_0210.enhancements.categorical_dtype> (:issue:#17636)"

... so I assume it is not.

Code Sample, a copy-pastable example if possible

Until recently (way more recently than this particular git commit):

In [2]: pd.__version__
Out[2]: '0.19.0+git14-ga40e185'

In [3]: pd.Series(['a', 'b', 'c']).astype(pd.types.dtypes.CategoricalDtype)
Out[3]: 
0    a
1    b
2    c
dtype: object

Now:

In [2]: pd.Series(['a', 'b', 'c']).astype(pd.core.dtypes.dtypes.CategoricalDtype)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-512aeb10c4b0> in <module>()
----> 1 pd.Series(['a', 'b', 'c']).astype(pd.core.dtypes.dtypes.CategoricalDtype)

/home/pietro/nobackup/repo/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    115                 else:
    116                     kwargs[new_arg_name] = new_arg_value
--> 117             return func(*args, **kwargs)
    118         return wrapper
    119     return _deprecate_kwarg

/home/pietro/nobackup/repo/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   3898         # else, only a single dtype is given
   3899         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 3900                                      **kwargs)
   3901         return self._constructor(new_data).__finalize__(self)
   3902 

/home/pietro/nobackup/repo/pandas/pandas/core/internals.py in astype(self, dtype, **kwargs)
   3404 
   3405     def astype(self, dtype, **kwargs):
-> 3406         return self.apply('astype', dtype=dtype, **kwargs)
   3407 
   3408     def convert(self, **kwargs):

/home/pietro/nobackup/repo/pandas/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3271 
   3272             kwargs['mgr'] = self
-> 3273             applied = getattr(b, f)(**kwargs)
   3274             result_blocks = _extend_blocks(applied, result_blocks)
   3275 

/home/pietro/nobackup/repo/pandas/pandas/core/internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    530     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    531         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 532                             **kwargs)
    533 
    534     def _astype(self, dtype, copy=False, errors='raise', values=None,

/home/pietro/nobackup/repo/pandas/pandas/core/internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, raise_on_error, **kwargs)
    548         # may need to convert to categorical
    549         # this is only called for non-categoricals
--> 550         if self.is_categorical_astype(dtype):
    551 
    552             # deprecated 17636

/home/pietro/nobackup/repo/pandas/pandas/core/internals.py in is_categorical_astype(self, dtype)
    143             # this is a pd.Categorical, but is not
    144             # a valid type for astypeing
--> 145             raise TypeError("invalid type {0} for astype".format(dtype))
    146 
    147         elif is_categorical_dtype(dtype):

TypeError: invalid type <class 'pandas.core.dtypes.dtypes.CategoricalDtype'> for astype

Problem description

Again, not really sure this is undesired... maybe it would be enough to clarify a bit the whatsnew note.

Expected Output

Previous behaviour.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.21.0.dev+572.g8e89cb3e1
pytest: 3.0.6
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: 0.4.1+dev
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 4, 2017

It needs to be an instance of CategoricalDtype:

.astype(pd.types.dtypes.CategoricalDtype(['a', 'b'])). We can probably catch this and raise a better error mesesage.

And you can use CategoricalDtype() to infer the categories, or just use the string 'category'.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Oct 4, 2017
@TomAugspurger TomAugspurger added Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas labels Oct 4, 2017
@jreback
Copy link
Contributor

jreback commented Oct 4, 2017

this should just raise as passing an instance of a type is not valid (if it worked before that was accidental). Though you could show a deprecation warning (and have it work) if you want.

@toobaz
Copy link
Member Author

toobaz commented Oct 4, 2017

If it was never supposed to work, then my opinion is that we can avoid the DeprecationWarning.

I just used to find it cleaner to do .astype(something) with something being actually a constant of type type, or np.dtype (not str, and not instantiated by me). But maybe I was just paranoid, in which case I think we can close this issue.

@jreback
Copy link
Contributor

jreback commented Oct 4, 2017

I think this should work, its similar to:

In [5]: Series([1,2, 3]).astype(np.float)
Out[5]: 
0    1.0
1    2.0
2    3.0
dtype: float64

where np.float is a type itself (like CDT). that said numpy doesn't support parameterized types in a first class way anyhow so it doesn't come up on the numpy side.

@jreback
Copy link
Contributor

jreback commented Oct 4, 2017

@toobaz would do a PR for this?

@toobaz
Copy link
Member Author

toobaz commented Oct 5, 2017

that said numpy doesn't support parameterized types in a first class way anyhow

Aren't strings an example?

In [2]: np.array([1, 2, 3]).astype(np.str_)
Out[2]: 
array(['1', '2', '3'], 
      dtype='<U21')

@toobaz would do a PR for this?

I can try (but not very soon)

@jorisvandenbossche
Copy link
Member

I think this should work, its similar to:

I first thought this as well (so putting me in favor of allowing the case from the initial post), but it is not fully equivalent.
As in numpy, np.float is not the dtype, it is a function to create a scalar of the floating dtype type ('the array scalar type'). An actual dtype is created with np.dtype(..) (so also an instance, not a class).

That said, I don't really think that equivalence (or non-equivalence) matters that much in deciding whether we want to allow .astype(CategoricalDtype) :-)

Whatever we do, it would be nice to deal with s.astype(pd.api.types.DatetimeTZDtype) case as well. Now this astypes to object (how CategoricalDtype worked before as well), but wouldn't it be more informative to raise an error message?

@TomAugspurger
Copy link
Contributor

That said, I don't really think that equivalence (or non-equivalence) matters that much in deciding whether we want to allow .astype(CategoricalDtype) :-)

I didn't realize that the previous behavior was to convert to object, and not Categorical.

I don't feel strongly about this at all, but I would prefer to raise with a nice error message.

Pro of error message: Catch cases where the user meant to instantiate the CDT with categories, but forgot

Con: Have to type an extra (), which I'm OK with.

I'll make a PR.

@TomAugspurger
Copy link
Contributor

Should we catch all the other extension types as well? I can't imagine a case were people want .astype(pd.api.types.IntervalDtype) to cast to object.

@toobaz
Copy link
Member Author

toobaz commented Oct 5, 2017

I didn't realize that the previous behavior was to convert to object, and not Categorical.

Aha, I had missed that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

4 participants