-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
issubdtype(<categorical>, np.bool_) raises error #9581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
numpy doesn't understand this you could do
|
Just understood that this is again the |
yep, you can also do |
The problem is more that every other dtype check is prone to raise :-( This is the line which raises (https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L185) # fastpath to avoid doing an item-by-item iteration over boolean
# arrays, as requested by #44
if hasattr(data, "dtype") and np.issubdtype(data.dtype, np.bool_): # <= !!! this line !!!
self._level_set = set([True, False])
return True -> if you want to work with pandas categorical data, you now have to look out for any (arbitrary) method, which errors out when getting a non-official dtype :-/ |
IMO the most user friendly solution would be to monkey-patch this numpy function to check for categorical first... |
you need to dispatch on categorical before this and treat it separately. Code that deals with multiple dtypes needs to be aware. You can simply You cannot patch a c-level function. |
https://github.com/numpy/numpy/blob/master/numpy/core/numerictypes.py#L736
|
The problem is that (this and probably other) code worked great before pandas 0.15 and works great as long as no categorical comes along but then breaks if you pass in a categorical. So more or less the categorical introduction breaks unrelated code. :-( |
not sure what to tell you |
So, push it to numpy-land? |
Couldn't this be solved by returning the
EDIT: I mean an instance of the object instead of string -> so returning the class instead of an instance for |
@JanSchulz I would explicity check or a |
@jorisvandenbossche
|
I'm confused: In[25]: type(s.dtype)
Out[25]: pandas.core.common.CategoricalDtype
In[26]: catdt = pd.core.categorical.CategoricalDtype
In[27]: np.issubdtype(catdt, np.bool_)
Out[27]: False
In[28]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-607a91e2a828>", line 1, in <module>
np.issubdtype(s.dtype, np.bool_)
File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood
In[29]: type(s.dtype) is type(catdt)
Out[29]: False
In[30]: type(s.dtype), type(catdt)
Out[30]: (pandas.core.common.CategoricalDtype, type)
In[31]: type(np.bool_)
Out[31]: type |
So if we would not use an object but the class it would work? |
Pandas did just go ahead and totally break tons of unrelated code here :-( It's pretty ugly to say that even simple checks that have nothing to do with categoricals (like "is this a boolean?"), and which have been written in the same way for as long as numpy existed, suddenly must always check for categoricals explicitly. |
E.g. patsy has 12 calls to |
Also, sorry for comment spam, but just noticed this while writing a safe issubdtype wrapper: the suggestion of replacing
with
is not correct in general -- e.g. |
@njsmith Do you see other possible solutions to this problem, apart from making |
Is it actually possible to extend (by writing c code) numpy dtypes from outside numpy? The other way I see is really take every numpy dtype function, test if they handle the new dtype correctly and if not write a wrapper and monkey-patch numpy when pandas is imported... Or write a real categorical numpy array... :-/ https://github.com/numpy/numpy/blob/4cbced22a8306d7214a4e18d12e651c034143922/doc/newdtype_example/floatint.c |
you need to write something like
in your code
|
I suppose the numpy friendly way to have handled this would have been to use |
I am not sure if the issue I am running into below is identical to this issue or not. Here are the steps that is raising the TypeError: data type not understood when it encounters this class 'pandas.core.dtypes.dtypes.CategoricalDtype':
The TypeError raised in step 5 was not very clear to me what is the issue, until I iterated through all the data types in step 6 that it was choking on the 'CategoricalDtype'. I am not sure if this is the correct forum to provide such information and I apologize in advance if this is not. If there is better Pythonic way to do what I need to do, please let me know as well. Thank you! |
this is a numpy issue fundamentally IIRC opened one a few years ago about this in any event pandas has lots of method to introspect dtypes via the pandas.api.types namespace so you shouldn’t need to do this |
My understanding is that the way to check for dtypes (including custom ones) that works for both numpy and pandas and never raises is looking at their In [2]: issubclass(np.array([1, 2, 3]).dtype.type, np.integer)
Out[2]: True
In [3]: issubclass(np.array('abc').dtype.type, str)
Out[3]: True ... except that: In [4]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.CategoricalDtype)
Out[4]: False
In [5]: pd.Series('abc', dtype='category').dtype.type == pd.CategoricalDtype
Out[5]: False because In [6]: pd.Series('abc', dtype='category').dtype.type.__mro__
Out[6]: (pandas.core.dtypes.dtypes.CategoricalDtypeType, type, object) and indeed: In [7]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.core.dtypes.dtypes.CategoricalDtypeType)
Out[7]: True Maybe that's something we can easily fix on the pandas side, so that the check doesn't require digging beyond the pandas API? What I mean is: In [7]: np.array('abc', dtype=np.dtype('<U').type)
Out[7]: array('abc', dtype='<U3') while: In [8]: pd.Series('abc', dtype=pd.CategoricalDtype.type)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-b3184e65f353> in <module>
----> 1 pd.Series('abc', dtype=pd.CategoricalDtype.type)
~/nobackup/repo/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
237 data = {}
238 if dtype is not None:
--> 239 dtype = self._validate_dtype(dtype)
240
241 if isinstance(data, MultiIndex):
~/nobackup/repo/pandas/pandas/core/generic.py in _validate_dtype(self, dtype)
262
263 if dtype is not None:
--> 264 dtype = pandas_dtype(dtype)
265
266 # a compound dtype
~/nobackup/repo/pandas/pandas/core/dtypes/common.py in pandas_dtype(dtype)
1886 return npdtype
1887 elif npdtype.kind == "O":
-> 1888 raise TypeError(f"dtype '{dtype}' not understood")
1889
1890 return npdtype
TypeError: dtype '<class 'pandas.core.dtypes.dtypes.CategoricalDtypeType'>' not understood ... that is, our distinction between a dtype and its type is something that numpy avoids, and maybe we can avoid too? (That's a real question, as I don't know the code well) I would go as far as saying that if we can fix this, and consider the above as the recommended way for type checks, then we could maybe:
|
Related: #8814 That settled the question for |
This is a problem in pydata/patsy#47
Not sure if there is an easy way to get numpy to understand this (I've absolutely no numpy foo :-/ ). If not this means that every patsy/statsmodels method which does dtype magic has to guard against the category dtype :-/
The text was updated successfully, but these errors were encountered: