Skip to content

issubdtype(<categorical>, np.bool_) raises error #9581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jankatins opened this issue Mar 3, 2015 · 26 comments
Open

issubdtype(<categorical>, np.bool_) raises error #9581

jankatins opened this issue Mar 3, 2015 · 26 comments
Labels
Bug Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions

Comments

@jankatins
Copy link
Contributor

In[14]: import pandas as pd
Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[15]: import numpy as np
In[16]: s = pd.Series([1,2,3,1,2,3]).astype("category")
In[17]: s
Out[17]: 
0    1
1    2
2    3
3    1
4    2
5    3
dtype: category
Categories (3, int64): [1 < 2 < 3]
In[18]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
  File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-607a91e2a828>", line 1, in <module>
    np.issubdtype(s.dtype, np.bool_)
  File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood

This is a problem in pydata/patsy#47

Not sure if there is an easy way to get numpy to understand this (I've absolutely no numpy foo :-/ ). If not this means that every patsy/statsmodels method which does dtype magic has to guard against the category dtype :-/

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

numpy doesn't understand this

you could do

isinstance(s.dtype, np.dtype) and np.issubdtype(s.dtype, np.bool_)

@jankatins
Copy link
Contributor Author

Just understood that this is again the dtype("category") problem :-(

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

yep, you can also do com.is_categorical_dtype(....) which is safe.

@jreback jreback added the Categorical Categorical Data Type label Mar 3, 2015
@jankatins
Copy link
Contributor Author

The problem is more that every other dtype check is prone to raise :-(

This is the line which raises (https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L185)

# fastpath to avoid doing an item-by-item iteration over boolean
        # arrays, as requested by #44
        if hasattr(data, "dtype") and np.issubdtype(data.dtype, np.bool_): # <= !!! this line !!!
            self._level_set = set([True, False])
            return True

-> if you want to work with pandas categorical data, you now have to look out for any (arbitrary) method, which errors out when getting a non-official dtype :-/

@jankatins
Copy link
Contributor Author

IMO the most user friendly solution would be to monkey-patch this numpy function to check for categorical first...

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

you need to dispatch on categorical before this and treat it separately. Code that deals with multiple dtypes needs to be aware. You can simply np.asarray if you want on everything.

You cannot patch a c-level function.

@jankatins
Copy link
Contributor Author

np.issubdtype is not a c level function :-)

https://github.com/numpy/numpy/blob/master/numpy/core/numerictypes.py#L736

def issubdtype(arg1, arg2):
    if issubclass_(arg2, generic):
        return issubclass(dtype(arg1).type, arg2)
    mro = dtype(arg2).type.mro()
    if len(mro) > 1:
        val = mro[1]
    else:
        val = mro[0]
    return issubclass(dtype(arg1).type, val)

@jankatins
Copy link
Contributor Author

The problem is that (this and probably other) code worked great before pandas 0.15 and works great as long as no categorical comes along but then breaks if you pass in a categorical. So more or less the categorical introduction breaks unrelated code. :-(

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

not sure what to tell you
the numpy functions are just not safe
you might have better luck doing
np.asarray(...) then checking that dtype

@jankatins
Copy link
Contributor Author

So, push it to numpy-land?

@jorisvandenbossche
Copy link
Member

Couldn't this be solved by returning the CategoricalDType object instead of a string for s.dtype?

In [86]: catdt = pd.core.categorical.CategoricalDtype

In [87]: np.issubdtype(catdt, np.bool_)
Out[87]: False

EDIT: I mean an instance of the object instead of string -> so returning the class instead of an instance for s.dtype

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

@JanSchulz I would explicity check or a Categorical in patsy. You need to handle them specially anyhow.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

@jorisvandenbossche
Its already a dtype, just not in the numpy hierarchy (and they are not careful about the check)

In [1]: s = Series(list('abc')).astype('category')

In [2]: s.dtype
Out[2]: category

In [3]: type(s.dtype)
Out[3]: pandas.core.common.CategoricalDtype

@jankatins
Copy link
Contributor Author

I'm confused:

In[25]: type(s.dtype)
Out[25]: pandas.core.common.CategoricalDtype
In[26]: catdt = pd.core.categorical.CategoricalDtype
In[27]:  np.issubdtype(catdt, np.bool_)
Out[27]: False
In[28]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
  File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-607a91e2a828>", line 1, in <module>
    np.issubdtype(s.dtype, np.bool_)
  File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood
In[29]: type(s.dtype) is type(catdt)
Out[29]: False
In[30]: type(s.dtype), type(catdt)
Out[30]: (pandas.core.common.CategoricalDtype, type)
In[31]: type(np.bool_)
Out[31]: type

@jankatins
Copy link
Contributor Author

So if we would not use an object but the class it would work?

@njsmith
Copy link

njsmith commented Mar 4, 2015

Pandas did just go ahead and totally break tons of unrelated code here :-( It's pretty ugly to say that even simple checks that have nothing to do with categoricals (like "is this a boolean?"), and which have been written in the same way for as long as numpy existed, suddenly must always check for categoricals explicitly.

@njsmith
Copy link

njsmith commented Mar 4, 2015

E.g. patsy has 12 calls to issubdtype and every one of them now needs to be audited to make sure they don't start raising random errors.

@njsmith
Copy link

njsmith commented Mar 4, 2015

Also, sorry for comment spam, but just noticed this while writing a safe issubdtype wrapper: the suggestion of replacing

np.issubdtype(foo, bar)

with

isinstance(foo, np.dtype) and np.issubdtype(foo, bar)

is not correct in general -- e.g. issubdtype(int, np.integer) is True, but the above expression will return False.

@jorisvandenbossche
Copy link
Member

@njsmith Do you see other possible solutions to this problem, apart from making CategoricalDType a real numpy dtype in C?

@jankatins
Copy link
Contributor Author

Is it actually possible to extend (by writing c code) numpy dtypes from outside numpy?

The other way I see is really take every numpy dtype function, test if they handle the new dtype correctly and if not write a wrapper and monkey-patch numpy when pandas is imported...

Or write a real categorical numpy array... :-/

https://github.com/numpy/numpy/blob/4cbced22a8306d7214a4e18d12e651c034143922/doc/newdtype_example/floatint.c
See also here: https://github.com/scipy/scipy/wiki/GSoC-project-ideas

@jreback
Copy link
Contributor

jreback commented Mar 4, 2015

you need to write something like

isinstance(obj, np.ndarray) and issubdtype(obj.dtype,np.bool_)

in your code

np.issubdtype is prob ok, though it should prob handle array-likes in a safer manner

@shoyer
Copy link
Member

shoyer commented Mar 9, 2015

I suppose the numpy friendly way to have handled this would have been to use dtype=object for Categorical, and add some custom attribute like pandas_dtype. But that's pretty damn awkward, too... and not really something I want to expose in our API.

@cequencer
Copy link

I am not sure if the issue I am running into below is identical to this issue or not. Here are the steps that is raising the TypeError: data type not understood when it encounters this class 'pandas.core.dtypes.dtypes.CategoricalDtype':

In[1]: pd.show_versions(as_json=False)
Out[1]: 
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.10.3.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 5.6.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

In[2]: dframe = pd.DataFrame({"A":["a","b","c","a"]})

In[3]: dframe["B"] = dframe["A"].astype('category')

In[4]: dframe.dtypes
Out[4]: 
A      object
B    category
dtype: object

In[5]: dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))
Out[5]: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-67-78973e46a59c> in <module>()
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))

/lib/python2.7/site-packages/pandas/core/series.pyc in map(self, arg, na_action)
   2352         else:
   2353             # arg is a function
-> 2354             new_values = map_f(values, arg)
   2355 
   2356         return self._constructor(new_values,

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-67-78973e46a59c> in <lambda>(x)
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))

/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    724     """
    725     if not issubclass_(arg1, generic):
--> 726         arg1 = dtype(arg1).type
    727     if not issubclass_(arg2, generic):
    728         arg2_orig = arg2

TypeError: data type not understood

In[6]: 
for i, v in dframe.dtypes.iteritems():
    print i
    print v
    t=type(v)
    print t
    if np.issubdtype(v, np.datetime64):
        print i
    print "----------------------"
Out[6]: 
A
object
<type 'numpy.dtype'>
----------------------
B
category
<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-71-24466a088749> in <module>()
      4     t=type(v)
      5     print t
----> 6     if np.issubdtype(v, np.datetime64):
      7         print i
      8     print "----------------------"

/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    724     """
    725     if not issubclass_(arg1, generic):
--> 726         arg1 = dtype(arg1).type
    727     if not issubclass_(arg2, generic):
    728         arg2_orig = arg2

TypeError: data type not understood

The TypeError raised in step 5 was not very clear to me what is the issue, until I iterated through all the data types in step 6 that it was choking on the 'CategoricalDtype'.

I am not sure if this is the correct forum to provide such information and I apologize in advance if this is not. If there is better Pythonic way to do what I need to do, please let me know as well. Thank you!

@jreback
Copy link
Contributor

jreback commented Jun 1, 2018

this is a numpy issue fundamentally IIRC opened one a few years ago about this

in any event pandas has lots of method to introspect dtypes via the pandas.api.types namespace so you shouldn’t need to do this

@mroeschke mroeschke added Bug Compat pandas objects compatability with Numpy or Python functions labels Jun 28, 2020
@toobaz
Copy link
Member

toobaz commented Dec 5, 2020

My understanding is that the way to check for dtypes (including custom ones) that works for both numpy and pandas and never raises is looking at their type attribute:

In [2]: issubclass(np.array([1, 2, 3]).dtype.type, np.integer)                                 
Out[2]: True

In [3]: issubclass(np.array('abc').dtype.type, str)                                            
Out[3]: True

... except that:

In [4]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.CategoricalDtype)         
Out[4]: False

In [5]: pd.Series('abc', dtype='category').dtype.type == pd.CategoricalDtype                   
Out[5]: False

because

In [6]: pd.Series('abc', dtype='category').dtype.type.__mro__                                  
Out[6]: (pandas.core.dtypes.dtypes.CategoricalDtypeType, type, object)

and indeed:

In [7]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.core.dtypes.dtypes.CategoricalDtypeType)                                                                         
Out[7]: True

Maybe that's something we can easily fix on the pandas side, so that the check doesn't require digging beyond the pandas API?

What I mean is:

In [7]: np.array('abc', dtype=np.dtype('<U').type)                                                                                                                                               
Out[7]: array('abc', dtype='<U3')

while:

In [8]:  pd.Series('abc', dtype=pd.CategoricalDtype.type)                                                                           
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-b3184e65f353> in <module>
----> 1  pd.Series('abc', dtype=pd.CategoricalDtype.type)

~/nobackup/repo/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    237                 data = {}
    238             if dtype is not None:
--> 239                 dtype = self._validate_dtype(dtype)
    240 
    241             if isinstance(data, MultiIndex):

~/nobackup/repo/pandas/pandas/core/generic.py in _validate_dtype(self, dtype)
    262 
    263         if dtype is not None:
--> 264             dtype = pandas_dtype(dtype)
    265 
    266             # a compound dtype

~/nobackup/repo/pandas/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1886         return npdtype
   1887     elif npdtype.kind == "O":
-> 1888         raise TypeError(f"dtype '{dtype}' not understood")
   1889 
   1890     return npdtype

TypeError: dtype '<class 'pandas.core.dtypes.dtypes.CategoricalDtypeType'>' not understood

... that is, our distinction between a dtype and its type is something that numpy avoids, and maybe we can avoid too? (That's a real question, as I don't know the code well)

I would go as far as saying that if we can fix this, and consider the above as the recommended way for type checks, then we could maybe:

  • provide some syntactic sugar for it (e.g. pd.Series.is_type(pd.CategoricalDtype))
  • sooner or later, deprecate the pd.Series.dtype == 'category' comparison which is incompatible with numpy string aliases (as it break the rules that if a == 'repr' and b == 'repr', then a == b)

@toobaz
Copy link
Member

toobaz commented Dec 5, 2020

Related: #8814

That settled the question for dataframe.dtypes == 'category', which however behaves differently from dataframe.dtypes.iloc[0] == 'category' if the first column is for instance a bool (and we know that won't change on the numpy side), which is sad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

No branches or pull requests

8 participants