Skip to content

ENH: Add support for dtype string aliases to Series#astype #556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 1, 2023

Conversation

skatsuta
Copy link
Contributor

@skatsuta skatsuta commented Mar 1, 2023

  • Closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added: Please use assert_type() to assert the type of any return value

Problem

Series#astype has been improved in #519 to allow specific types to be inferred based on a value of the dtype argument.
I really appreciate the hard work that has been put into it.

However, that PR lacked support for aliases that specify dtype as a string.
Therefore, starting with version v1.5.3.230227, type checkers now report the following errors for code like below:

Sample code

s = pd.Series([1, 2, 3])
s_nullable_int = s.astype("Int64")
s_nullable_float = s.astype("Float64")

Errors reported by Pyright

error: Argument of type "Literal['Int64']" cannot be assigned to parameter "dtype" of type "ExtensionDtype" in function "astype"
    "Literal['Int64']" is incompatible with "ExtensionDtype" (reportGeneralTypeIssues)
error: Argument of type "Literal['Float64']" cannot be assigned to parameter "dtype" of type "ExtensionDtype" in function "astype"
    "Literal['Float64']" is incompatible with "ExtensionDtype" (reportGeneralTypeIssues)

Specifying dtype as a string is perfectly fine in pandas, so we need to fix the stub so that the errors are not reported for code like the above.

Solution

Add support for various dtype string aliases to (Boolean|Int|Str|Float|Complex)DtypeArg types, and add tests for passing each of those aliases as arguments to Series#astype.

Along with this, I have also refactored the stubs and tests to group code for the same dtype together so that it is easier to see what kind of dtype arguments are supported.

@skatsuta skatsuta changed the title ENH: Add support for dtype string aliases and missing types to Series#astype ENH: Add support for dtype string aliases to Series#astype Mar 1, 2023
Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skatsuta . I had a feeling that someone would pick up many of the possibilities that were missed when we put this together.

@Dr-Irv Dr-Irv merged commit d2faa7f into pandas-dev:main Mar 1, 2023
@skatsuta
Copy link
Contributor Author

skatsuta commented Mar 1, 2023

@Dr-Irv Thanks for merging quickly! Your hunch was right on the money ;)

Will this fix be released in a few days?
Since this issue is a sort of regression, and likely to have some impact on existing projects, including the one I'm working on, it might be good to release this patch relatively sooner than usual.

@skatsuta skatsuta deleted the fix/series-astype-nullable-dtypes branch March 1, 2023 12:48
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 1, 2023

I should be able to do a release by Friday 3/3, if not earlier.

@zmoon
Copy link
Contributor

zmoon commented Mar 1, 2023

Looks like 'category' is still missing?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 1, 2023

Looks like 'category' is still missing?

No, it's there.

CategoryDtypeArg: TypeAlias = Literal["category"]

@zmoon
Copy link
Contributor

zmoon commented Mar 1, 2023

Weird, using v1.5.3.230227, which has that, I got this message

error: Argument 1 to "astype" of "DataFrame" has incompatible type "Literal['category']"; expected "Union[Union[Union[Type[bool], Type[bool_], BooleanDtype, Literal['bool']], Union[Type[int], Int8Dtype, Int16Dtype, Int32Dtype, Int64Dtype, Type[signedinteger[_8Bit]], Type[signedinteger[_16Bit]], Type[signedinteger[_32Bit]], Type[signedinteger[_64Bit]], Type[unsignedinteger[_8Bit]], Type[unsignedinteger[_16Bit]], Type[unsignedinteger[_32Bit]], Type[unsignedinteger[_64Bit]], Type[signedinteger[Any]], Type[unsignedinteger[Any]], Type[signedinteger[Any]], Type[unsignedinteger[Any]], Literal['int', 'int32']], Union[Type[str], StringDtype, Literal['str']], Type[bytes], Union[Float32Dtype, Float64Dtype, Type[floating[_16Bit]], Type[floating[_32Bit]], Type[floating[_64Bit]], Type[float], Literal['float']], Union[Type[complexfloating[_32Bit, _32Bit]], Type[complexfloating[_64Bit, _64Bit]], Type[complex], Literal['complex']], CategoricalDtype, ExtensionDtype, Literal['timedelta64[ns]', 'datetime64[ns]']], Mapping[Any, Union[ExtensionDtype, Union[str, dtype[generic], Type[str], Type[complex], Type[bool], Type[object]]]], Series[Any]]" [arg-type]

I guess because CategoryDtypeArg is missing from AstypeArg.

AstypeArg: TypeAlias = (
BooleanDtypeArg
| IntDtypeArg
| StrDtypeArg
| BytesDtypeArg
| FloatDtypeArg
| ComplexDtypeArg
| TimedeltaDtypeArg
| TimestampDtypeArg
| CategoricalDtype
| ExtensionDtype
)

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 1, 2023

I guess because CategoryDtypeArg is missing from AstypeArg.

Yes, you are correct. Can you open up a new issue for DataFrame.astype("category") and maybe follow up with a PR?

twoertwein pushed a commit to twoertwein/pandas-stubs that referenced this pull request Apr 1, 2023
…dev#556)

* Add support for nullable integer data types to Series#astype

* Add support for nullable float data types to Series#astype

* Add support for nullable boolean data type to Series#astype

* Add support for nullable string data type to Series#astype

* Refactor dtype arg type aliases and add missing dtype aliases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants