Skip to content

BUG: astyping Categorical to nullable integer dtype #39616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arw2019 opened this issue Feb 5, 2021 · 4 comments · Fixed by #45167
Closed

BUG: astyping Categorical to nullable integer dtype #39616

arw2019 opened this issue Feb 5, 2021 · 4 comments · Fixed by #45167
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@arw2019
Copy link
Member

arw2019 commented Feb 5, 2021

This astyping op works for int64 but throws for Int64. It should work the same for both

In [16]: import numpy as np
    ...: import pandas as pd
    ...: 
    ...: dtype = pd.CategoricalDtype([str(i) for i in range(5)])
    ...: arr = pd.Categorical.from_codes(np.random.randint(5, size=20), dtype=dtype)
    ...: arr
Out[16]: 
['4', '1', '0', '1', '4', ..., '3', '0', '1', '1', '0']
Length: 20
Categories (5, object): ['0', '1', '2', '3', '4']

In [17]: arr.astype("int64")
Out[17]: array([4, 1, 0, 1, 4, 0, 1, 0, 3, 0, 3, 0, 1, 4, 2, 3, 0, 1, 1, 0])

In [18]: arr.astype("Int64")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-440524b22c6b> in <module>
----> 1 arr.astype("Int64")

~/repos/pandas/pandas/core/arrays/categorical.py in astype(self, dtype, copy)
    456         # TODO: consolidate with ndarray case?
    457         elif is_extension_array_dtype(dtype):
--> 458             result = array(self, dtype=dtype, copy=copy)
    459 
    460         elif is_integer_dtype(dtype) and self.isna().any():

~/repos/pandas/pandas/core/construction.py in array(data, dtype, copy)
    292     if is_extension_array_dtype(dtype):
    293         cls = cast(ExtensionDtype, dtype).construct_array_type()
--> 294         return cls._from_sequence(data, dtype=dtype, copy=copy)
    295 
    296     if dtype is None:

~/repos/pandas/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    306         cls, scalars, *, dtype: Optional[Dtype] = None, copy: bool = False
    307     ) -> IntegerArray:
--> 308         values, mask = coerce_to_array(scalars, dtype=dtype, copy=copy)
    309         return IntegerArray(values, mask)
    310 

~/repos/pandas/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    170             "mixed-integer-float",
    171         ]:
--> 172             raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
    173 
    174     elif is_bool_dtype(values) and is_integer_dtype(dtype):

TypeError: object cannot be converted to an IntegerDtype
@arw2019 arw2019 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 5, 2021
@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2021
@jorisvandenbossche
Copy link
Member

Yes, that's a bug. It could be fixed in IntegerArray._from_sequence for now (although ideally would be fixed by a more general casting mechanism, cfr #22384)

@OlehKSS
Copy link
Contributor

OlehKSS commented Feb 11, 2021

Hi, I have made a small investigation of this bug. I am not familiar with your type system, so I will need some suggestions from you on how I should proceed to solve it.
The problem here seems two-fold:

  1. On this line I have found that pandas_dtype(dtype) returns numpy.dtype('int64') for dtype='int64' and pandas.core.arrays.integer.Int64Dtype for dtype='Int64'. I suppose it was intended. Although, due to this difference we have two different paths for the Categorical.astype method execution.
  2. If the returned type is dtype('int64') , values will be converted to an integer numpy array.
  3. In the other case, dtype=Int64Dtype, the IntegerArray._from_sequence method will be called. The above TypeError is raised because the integer.coerce_to_array function cannot convert a list of string values into a list of integers.

How should it be solved?

  • Should the dtype parameter become type-insensitive, i.e. treat Int64 and int64 as the same type?
  • Should IntegerArray._from_sequence and integer.coerce_to_array be changed? E.g. On this line replace values = np.array(values, copy=copy) with values = np.array(values, dtype=dtype, copy=copy)? In that case we'll get an instance of IntegerArray type as the output value of the astype method.
  • Should the lib.infer_dtype be able to convert a list of the numeric strings as integers/floats?

@jorisvandenbossche
Copy link
Member

Thanks for taking a look!

1. On this line I have found that pandas_dtype(dtype) returns numpy.dtype('int64') for dtype='int64' and pandas.core.arrays.integer.Int64Dtype for dtype='Int64'. I suppose it was intended. Although, due to this difference we have two different paths for the Categorical.astype method execution.

Yes, that is expected (see https://pandas.pydata.org/docs/dev/user_guide/integer_na.html for some explanation about this, the case sensitivity is intended to differentiate between numpy's dtype and our nullable dtype)

3. The above TypeError is raised because the integer.coerce_to_array function cannot convert a list of string values into a list of integers.

Ah, I didn't notice above that it were integer-like strings. So in that case, the issue can actually be simplified to the following case that also fails, i.e. creating an integer array from strings:

>>> pd.array(np.array(['1', '2'], dtype=object), dtype="Int64")
...
~/scipy/pandas/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    170             "mixed-integer-float",
    171         ]:
--> 172             raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
    173 
    174     elif is_bool_dtype(values) and is_integer_dtype(dtype):

TypeError: object cannot be converted to an IntegerDtype

>>> pd.array(np.array['1', '2'], dtype="Int64")
...
~/scipy/pandas/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    176 
    177     elif not (is_integer_dtype(values) or is_float_dtype(values)):
--> 178         raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
    179 
    180     if mask is None:

TypeError: <U1 cannot be converted to an IntegerDtype

So I think the first question we need to decide on is if we want to support converting strings to integers like that in general. I think the answer is yes (numpy supports it, and so we already support it for the non-nullable dtypes as well, eg pd.Series(['1', '2'], dtype="int64") works)

Should IntegerArray._from_sequence and integer.coerce_to_array be changed?

This can be updated in coerce_to_array

Should the lib.infer_dtype be able to convert a list of the numeric strings as integers/floats?

infer_dtype itself shouldn't infer that I think (it should infer that it are strings). But so in the case that we infer that the values are string (or mixed), we can do an attempt to convert them to ints (eg using numpy's astype(int)).

@OlehKSS
Copy link
Contributor

OlehKSS commented Feb 12, 2021

Thanks for the prompt reply, I have found out that besides IntegerArray._from_sequence there is IntegerArray._from_sequence_of_strings as well. Would it be better to call this function here? Something like

if is_extension_array_dtype(dtype):
    cls = cast(ExtensionDtype, dtype).construct_array_type()
    inferred_dtype = lib.infer_dtype(data, skipna=True)
    if inferred_dtype == "string":
        return cls._from_sequence_of_strings(data, dtype=dtype, copy=copy)
    else:
        return cls._from_sequence(data, dtype=dtype, copy=copy)

The problem with this particular case that we need to find out the underlying data type of CategoricalDtype, e.g. to infer that categorical entries are strings. Also, not all extension classes have the ._from_sequence_of_strings method implemented, e.g. IntervalArray

Otherwise, it is possible to change integer.coerce_to_array to support strings, though it seems to duplicate IntegerArray._from_sequence_of_strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants