issubdtype(<categorical>, np.bool_) raises error #9581

jankatins · 2015-03-03T19:57:28Z

In[14]: import pandas as pd
Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[15]: import numpy as np
In[16]: s = pd.Series([1,2,3,1,2,3]).astype("category")
In[17]: s
Out[17]: 
0    1
1    2
2    3
3    1
4    2
5    3
dtype: category
Categories (3, int64): [1 < 2 < 3]
In[18]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
  File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-607a91e2a828>", line 1, in <module>
    np.issubdtype(s.dtype, np.bool_)
  File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood

This is a problem in pydata/patsy#47

Not sure if there is an easy way to get numpy to understand this (I've absolutely no numpy foo :-/ ). If not this means that every patsy/statsmodels method which does dtype magic has to guard against the category dtype :-/

jreback · 2015-03-03T20:01:34Z

numpy doesn't understand this

you could do

isinstance(s.dtype, np.dtype) and np.issubdtype(s.dtype, np.bool_)

jankatins · 2015-03-03T20:32:55Z

Just understood that this is again the dtype("category") problem :-(

jreback · 2015-03-03T20:37:28Z

yep, you can also do com.is_categorical_dtype(....) which is safe.

jankatins · 2015-03-03T20:57:01Z

The problem is more that every other dtype check is prone to raise :-(

This is the line which raises (https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L185)

# fastpath to avoid doing an item-by-item iteration over boolean
        # arrays, as requested by #44
        if hasattr(data, "dtype") and np.issubdtype(data.dtype, np.bool_): # <= !!! this line !!!
            self._level_set = set([True, False])
            return True

-> if you want to work with pandas categorical data, you now have to look out for any (arbitrary) method, which errors out when getting a non-official dtype :-/

jankatins · 2015-03-03T21:00:51Z

IMO the most user friendly solution would be to monkey-patch this numpy function to check for categorical first...

jreback · 2015-03-03T21:01:09Z

you need to dispatch on categorical before this and treat it separately. Code that deals with multiple dtypes needs to be aware. You can simply np.asarray if you want on everything.

You cannot patch a c-level function.

jankatins · 2015-03-03T21:29:11Z

np.issubdtype is not a c level function :-)

https://github.com/numpy/numpy/blob/master/numpy/core/numerictypes.py#L736

def issubdtype(arg1, arg2):
    if issubclass_(arg2, generic):
        return issubclass(dtype(arg1).type, arg2)
    mro = dtype(arg2).type.mro()
    if len(mro) > 1:
        val = mro[1]
    else:
        val = mro[0]
    return issubclass(dtype(arg1).type, val)

jankatins · 2015-03-03T21:33:48Z

The problem is that (this and probably other) code worked great before pandas 0.15 and works great as long as no categorical comes along but then breaks if you pass in a categorical. So more or less the categorical introduction breaks unrelated code. :-(

jreback · 2015-03-03T21:37:42Z

not sure what to tell you
the numpy functions are just not safe
you might have better luck doing
np.asarray(...) then checking that dtype

jankatins · 2015-03-03T22:07:35Z

So, push it to numpy-land?

jorisvandenbossche · 2015-03-03T22:15:37Z

Couldn't this be solved by returning the CategoricalDType object instead of a string for s.dtype?

In [86]: catdt = pd.core.categorical.CategoricalDtype

In [87]: np.issubdtype(catdt, np.bool_)
Out[87]: False

EDIT: I mean an instance of the object instead of string -> so returning the class instead of an instance for s.dtype

jreback · 2015-03-03T22:29:50Z

@JanSchulz I would explicity check or a Categorical in patsy. You need to handle them specially anyhow.

jreback · 2015-03-03T22:30:15Z

@jorisvandenbossche
Its already a dtype, just not in the numpy hierarchy (and they are not careful about the check)

In [1]: s = Series(list('abc')).astype('category')

In [2]: s.dtype
Out[2]: category

In [3]: type(s.dtype)
Out[3]: pandas.core.common.CategoricalDtype

jankatins · 2015-03-03T22:38:59Z

I'm confused:

In[25]: type(s.dtype)
Out[25]: pandas.core.common.CategoricalDtype
In[26]: catdt = pd.core.categorical.CategoricalDtype
In[27]:  np.issubdtype(catdt, np.bool_)
Out[27]: False
In[28]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
  File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-607a91e2a828>", line 1, in <module>
    np.issubdtype(s.dtype, np.bool_)
  File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood
In[29]: type(s.dtype) is type(catdt)
Out[29]: False
In[30]: type(s.dtype), type(catdt)
Out[30]: (pandas.core.common.CategoricalDtype, type)
In[31]: type(np.bool_)
Out[31]: type

jankatins · 2015-03-03T22:42:17Z

So if we would not use an object but the class it would work?

njsmith · 2015-03-04T05:20:32Z

Pandas did just go ahead and totally break tons of unrelated code here :-( It's pretty ugly to say that even simple checks that have nothing to do with categoricals (like "is this a boolean?"), and which have been written in the same way for as long as numpy existed, suddenly must always check for categoricals explicitly.

njsmith · 2015-03-04T05:22:23Z

E.g. patsy has 12 calls to issubdtype and every one of them now needs to be audited to make sure they don't start raising random errors.

njsmith · 2015-03-04T05:29:42Z

Also, sorry for comment spam, but just noticed this while writing a safe issubdtype wrapper: the suggestion of replacing

np.issubdtype(foo, bar)

with

isinstance(foo, np.dtype) and np.issubdtype(foo, bar)

is not correct in general -- e.g. issubdtype(int, np.integer) is True, but the above expression will return False.

jorisvandenbossche · 2015-03-04T09:57:12Z

@njsmith Do you see other possible solutions to this problem, apart from making CategoricalDType a real numpy dtype in C?

jankatins · 2015-03-04T12:34:31Z

Is it actually possible to extend (by writing c code) numpy dtypes from outside numpy?

The other way I see is really take every numpy dtype function, test if they handle the new dtype correctly and if not write a wrapper and monkey-patch numpy when pandas is imported...

Or write a real categorical numpy array... :-/

https://github.com/numpy/numpy/blob/4cbced22a8306d7214a4e18d12e651c034143922/doc/newdtype_example/floatint.c
See also here: https://github.com/scipy/scipy/wiki/GSoC-project-ideas

jreback · 2015-03-04T12:41:20Z

you need to write something like

isinstance(obj, np.ndarray) and issubdtype(obj.dtype,np.bool_)

in your code

np.issubdtype is prob ok, though it should prob handle array-likes in a safer manner

shoyer · 2015-03-09T08:09:49Z

I suppose the numpy friendly way to have handled this would have been to use dtype=object for Categorical, and add some custom attribute like pandas_dtype. But that's pretty damn awkward, too... and not really something I want to expose in our API.

cequencer · 2018-06-01T18:31:00Z

I am not sure if the issue I am running into below is identical to this issue or not. Here are the steps that is raising the TypeError: data type not understood when it encounters this class 'pandas.core.dtypes.dtypes.CategoricalDtype':

In[1]: pd.show_versions(as_json=False)
Out[1]: 
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.10.3.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 5.6.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

In[2]: dframe = pd.DataFrame({"A":["a","b","c","a"]})

In[3]: dframe["B"] = dframe["A"].astype('category')

In[4]: dframe.dtypes
Out[4]: 
A      object
B    category
dtype: object

In[5]: dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))
Out[5]: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-67-78973e46a59c> in <module>()
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))

/lib/python2.7/site-packages/pandas/core/series.pyc in map(self, arg, na_action)
   2352         else:
   2353             # arg is a function
-> 2354             new_values = map_f(values, arg)
   2355 
   2356         return self._constructor(new_values,

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-67-78973e46a59c> in <lambda>(x)
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))

/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    724     """
    725     if not issubclass_(arg1, generic):
--> 726         arg1 = dtype(arg1).type
    727     if not issubclass_(arg2, generic):
    728         arg2_orig = arg2

TypeError: data type not understood

In[6]: 
for i, v in dframe.dtypes.iteritems():
    print i
    print v
    t=type(v)
    print t
    if np.issubdtype(v, np.datetime64):
        print i
    print "----------------------"
Out[6]: 
A
object
<type 'numpy.dtype'>
----------------------
B
category
<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-71-24466a088749> in <module>()
      4     t=type(v)
      5     print t
----> 6     if np.issubdtype(v, np.datetime64):
      7         print i
      8     print "----------------------"

/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    724     """
    725     if not issubclass_(arg1, generic):
--> 726         arg1 = dtype(arg1).type
    727     if not issubclass_(arg2, generic):
    728         arg2_orig = arg2

TypeError: data type not understood

The TypeError raised in step 5 was not very clear to me what is the issue, until I iterated through all the data types in step 6 that it was choking on the 'CategoricalDtype'.

I am not sure if this is the correct forum to provide such information and I apologize in advance if this is not. If there is better Pythonic way to do what I need to do, please let me know as well. Thank you!

jreback · 2018-06-01T23:10:45Z

this is a numpy issue fundamentally IIRC opened one a few years ago about this

in any event pandas has lots of method to introspect dtypes via the pandas.api.types namespace so you shouldn’t need to do this

toobaz · 2020-12-05T09:22:53Z

My understanding is that the way to check for dtypes (including custom ones) that works for both numpy and pandas and never raises is looking at their type attribute:

In [2]: issubclass(np.array([1, 2, 3]).dtype.type, np.integer)                                 
Out[2]: True

In [3]: issubclass(np.array('abc').dtype.type, str)                                            
Out[3]: True

... except that:

In [4]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.CategoricalDtype)         
Out[4]: False

In [5]: pd.Series('abc', dtype='category').dtype.type == pd.CategoricalDtype                   
Out[5]: False

because

In [6]: pd.Series('abc', dtype='category').dtype.type.__mro__                                  
Out[6]: (pandas.core.dtypes.dtypes.CategoricalDtypeType, type, object)

and indeed:

In [7]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.core.dtypes.dtypes.CategoricalDtypeType)                                                                         
Out[7]: True

Maybe that's something we can easily fix on the pandas side, so that the check doesn't require digging beyond the pandas API?

What I mean is:

In [7]: np.array('abc', dtype=np.dtype('<U').type)                                                                                                                                               
Out[7]: array('abc', dtype='<U3')

while:

In [8]:  pd.Series('abc', dtype=pd.CategoricalDtype.type)                                                                           
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-b3184e65f353> in <module>
----> 1  pd.Series('abc', dtype=pd.CategoricalDtype.type)

~/nobackup/repo/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    237                 data = {}
    238             if dtype is not None:
--> 239                 dtype = self._validate_dtype(dtype)
    240 
    241             if isinstance(data, MultiIndex):

~/nobackup/repo/pandas/pandas/core/generic.py in _validate_dtype(self, dtype)
    262 
    263         if dtype is not None:
--> 264             dtype = pandas_dtype(dtype)
    265 
    266             # a compound dtype

~/nobackup/repo/pandas/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1886         return npdtype
   1887     elif npdtype.kind == "O":
-> 1888         raise TypeError(f"dtype '{dtype}' not understood")
   1889 
   1890     return npdtype

TypeError: dtype '<class 'pandas.core.dtypes.dtypes.CategoricalDtypeType'>' not understood

... that is, our distinction between a dtype and its type is something that numpy avoids, and maybe we can avoid too? (That's a real question, as I don't know the code well)

I would go as far as saying that if we can fix this, and consider the above as the recommended way for type checks, then we could maybe:

provide some syntactic sugar for it (e.g. pd.Series.is_type(pd.CategoricalDtype))
sooner or later, deprecate the pd.Series.dtype == 'category' comparison which is incompatible with numpy string aliases (as it break the rules that if a == 'repr' and b == 'repr', then a == b)

toobaz · 2020-12-05T09:26:48Z

Related: #8814

That settled the question for dataframe.dtypes == 'category', which however behaves differently from dataframe.dtypes.iloc[0] == 'category' if the first column is for instance a bool (and we know that won't change on the numpy side), which is sad.

jreback added the Categorical Categorical Data Type label Mar 3, 2015

njsmith mentioned this issue Mar 4, 2015

Pandas 0.15 categorical handling pydata/patsy#59

Merged

mroeschke added Bug Compat pandas objects compatability with Numpy or Python functions labels Jun 28, 2020

simonjayhawkins mentioned this issue Jul 19, 2020

ENH: Select numeric ExtensionDtypes with DataFrame.select_dtypes #35341

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issubdtype(<categorical>, np.bool_) raises error #9581

issubdtype(<categorical>, np.bool_) raises error #9581

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jorisvandenbossche commented Mar 3, 2015

jreback commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jankatins commented Mar 3, 2015

njsmith commented Mar 4, 2015

njsmith commented Mar 4, 2015

njsmith commented Mar 4, 2015

jorisvandenbossche commented Mar 4, 2015

jankatins commented Mar 4, 2015

jreback commented Mar 4, 2015

shoyer commented Mar 9, 2015

cequencer commented Jun 1, 2018

jreback commented Jun 1, 2018

toobaz commented Dec 5, 2020

toobaz commented Dec 5, 2020

issubdtype(<categorical>, np.bool_) raises error #9581

issubdtype(<categorical>, np.bool_) raises error #9581

Comments

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jankatins commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jorisvandenbossche commented Mar 3, 2015

jreback commented Mar 3, 2015

jreback commented Mar 3, 2015

jankatins commented Mar 3, 2015

jankatins commented Mar 3, 2015

njsmith commented Mar 4, 2015

njsmith commented Mar 4, 2015

njsmith commented Mar 4, 2015

jorisvandenbossche commented Mar 4, 2015

jankatins commented Mar 4, 2015

jreback commented Mar 4, 2015

shoyer commented Mar 9, 2015

cequencer commented Jun 1, 2018

jreback commented Jun 1, 2018

toobaz commented Dec 5, 2020

toobaz commented Dec 5, 2020