BUG: astyping Categorical to nullable integer dtype #39616

arw2019 · 2021-02-05T17:49:36Z

This astyping op works for int64 but throws for Int64. It should work the same for both

In [16]: import numpy as np
    ...: import pandas as pd
    ...: 
    ...: dtype = pd.CategoricalDtype([str(i) for i in range(5)])
    ...: arr = pd.Categorical.from_codes(np.random.randint(5, size=20), dtype=dtype)
    ...: arr
Out[16]: 
['4', '1', '0', '1', '4', ..., '3', '0', '1', '1', '0']
Length: 20
Categories (5, object): ['0', '1', '2', '3', '4']

In [17]: arr.astype("int64")
Out[17]: array([4, 1, 0, 1, 4, 0, 1, 0, 3, 0, 3, 0, 1, 4, 2, 3, 0, 1, 1, 0])

In [18]: arr.astype("Int64")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-440524b22c6b> in <module>
----> 1 arr.astype("Int64")

~/repos/pandas/pandas/core/arrays/categorical.py in astype(self, dtype, copy)
    456         # TODO: consolidate with ndarray case?
    457         elif is_extension_array_dtype(dtype):
--> 458             result = array(self, dtype=dtype, copy=copy)
    459 
    460         elif is_integer_dtype(dtype) and self.isna().any():

~/repos/pandas/pandas/core/construction.py in array(data, dtype, copy)
    292     if is_extension_array_dtype(dtype):
    293         cls = cast(ExtensionDtype, dtype).construct_array_type()
--> 294         return cls._from_sequence(data, dtype=dtype, copy=copy)
    295 
    296     if dtype is None:

~/repos/pandas/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    306         cls, scalars, *, dtype: Optional[Dtype] = None, copy: bool = False
    307     ) -> IntegerArray:
--> 308         values, mask = coerce_to_array(scalars, dtype=dtype, copy=copy)
    309         return IntegerArray(values, mask)
    310 

~/repos/pandas/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    170             "mixed-integer-float",
    171         ]:
--> 172             raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
    173 
    174     elif is_bool_dtype(values) and is_integer_dtype(dtype):

TypeError: object cannot be converted to an IntegerDtype

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-02-10T14:31:09Z

Yes, that's a bug. It could be fixed in IntegerArray._from_sequence for now (although ideally would be fixed by a more general casting mechanism, cfr #22384)

OlehKSS · 2021-02-11T23:19:56Z

Hi, I have made a small investigation of this bug. I am not familiar with your type system, so I will need some suggestions from you on how I should proceed to solve it.
The problem here seems two-fold:

On this line I have found that pandas_dtype(dtype) returns numpy.dtype('int64') for dtype='int64' and pandas.core.arrays.integer.Int64Dtype for dtype='Int64'. I suppose it was intended. Although, due to this difference we have two different paths for the Categorical.astype method execution.
If the returned type is dtype('int64') , values will be converted to an integer numpy array.
In the other case, dtype=Int64Dtype, the IntegerArray._from_sequence method will be called. The above TypeError is raised because the integer.coerce_to_array function cannot convert a list of string values into a list of integers.

How should it be solved?

Should the dtype parameter become type-insensitive, i.e. treat Int64 and int64 as the same type?
Should IntegerArray._from_sequence and integer.coerce_to_array be changed? E.g. On this line replace values = np.array(values, copy=copy) with values = np.array(values, dtype=dtype, copy=copy)? In that case we'll get an instance of IntegerArray type as the output value of the astype method.
Should the lib.infer_dtype be able to convert a list of the numeric strings as integers/floats?

jorisvandenbossche · 2021-02-12T16:29:05Z

Thanks for taking a look!

1. On this line I have found that pandas_dtype(dtype) returns numpy.dtype('int64') for dtype='int64' and pandas.core.arrays.integer.Int64Dtype for dtype='Int64'. I suppose it was intended. Although, due to this difference we have two different paths for the Categorical.astype method execution.

Yes, that is expected (see https://pandas.pydata.org/docs/dev/user_guide/integer_na.html for some explanation about this, the case sensitivity is intended to differentiate between numpy's dtype and our nullable dtype)

3. The above TypeError is raised because the integer.coerce_to_array function cannot convert a list of string values into a list of integers.

Ah, I didn't notice above that it were integer-like strings. So in that case, the issue can actually be simplified to the following case that also fails, i.e. creating an integer array from strings:

>>> pd.array(np.array(['1', '2'], dtype=object), dtype="Int64")
...
~/scipy/pandas/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    170             "mixed-integer-float",
    171         ]:
--> 172             raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
    173 
    174     elif is_bool_dtype(values) and is_integer_dtype(dtype):

TypeError: object cannot be converted to an IntegerDtype

>>> pd.array(np.array['1', '2'], dtype="Int64")
...
~/scipy/pandas/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    176 
    177     elif not (is_integer_dtype(values) or is_float_dtype(values)):
--> 178         raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
    179 
    180     if mask is None:

TypeError: <U1 cannot be converted to an IntegerDtype

So I think the first question we need to decide on is if we want to support converting strings to integers like that in general. I think the answer is yes (numpy supports it, and so we already support it for the non-nullable dtypes as well, eg pd.Series(['1', '2'], dtype="int64") works)

Should IntegerArray._from_sequence and integer.coerce_to_array be changed?

This can be updated in coerce_to_array

Should the lib.infer_dtype be able to convert a list of the numeric strings as integers/floats?

infer_dtype itself shouldn't infer that I think (it should infer that it are strings). But so in the case that we infer that the values are string (or mixed), we can do an attempt to convert them to ints (eg using numpy's astype(int)).

OlehKSS · 2021-02-12T22:47:23Z

Thanks for the prompt reply, I have found out that besides IntegerArray._from_sequence there is IntegerArray._from_sequence_of_strings as well. Would it be better to call this function here? Something like

if is_extension_array_dtype(dtype):
    cls = cast(ExtensionDtype, dtype).construct_array_type()
    inferred_dtype = lib.infer_dtype(data, skipna=True)
    if inferred_dtype == "string":
        return cls._from_sequence_of_strings(data, dtype=dtype, copy=copy)
    else:
        return cls._from_sequence(data, dtype=dtype, copy=copy)

The problem with this particular case that we need to find out the underlying data type of CategoricalDtype, e.g. to infer that categorical entries are strings. Also, not all extension classes have the ._from_sequence_of_strings method implemented, e.g. IntervalArray

Otherwise, it is possible to change integer.coerce_to_array to support strings, though it seems to duplicate IntegerArray._from_sequence_of_strings.

arw2019 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 5, 2021

jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2021

RagBlufThim mentioned this issue Apr 20, 2021

BUG: Conversion of Series dtype from object to Int16 etc. fails #41060

Open

3 tasks

jbrockmendel mentioned this issue Jan 2, 2022

TST: tests for nullable issues #45167

Merged

7 tasks

jreback added this to the 1.4 milestone Jan 3, 2022

jreback closed this as completed in #45167 Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: astyping Categorical to nullable integer dtype #39616

BUG: astyping Categorical to nullable integer dtype #39616

arw2019 commented Feb 5, 2021

jorisvandenbossche commented Feb 10, 2021

OlehKSS commented Feb 11, 2021

jorisvandenbossche commented Feb 12, 2021

OlehKSS commented Feb 12, 2021 •

edited

Loading

BUG: astyping Categorical to nullable integer dtype #39616

BUG: astyping Categorical to nullable integer dtype #39616

Comments

arw2019 commented Feb 5, 2021

jorisvandenbossche commented Feb 10, 2021

OlehKSS commented Feb 11, 2021

jorisvandenbossche commented Feb 12, 2021

OlehKSS commented Feb 12, 2021 • edited Loading

OlehKSS commented Feb 12, 2021 •

edited

Loading