Skip to content

Support an ExtensionDtype and ExtensionArray #554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 27, 2023

Conversation

Dr-Irv
Copy link
Collaborator

@Dr-Irv Dr-Irv commented Feb 25, 2023

While working on #519 , I wanted to test that we could pass a user-created ExtensionDtype and ExtensionArray . So I copied the testing code from pandas that creates a DecimalArray, and it revealed a number of issues with our type declarations.

So this PR addresses that. Not everything is really tested in terms of all the ExtensionArray declarations, but the goal here is to see that a user-defined ExtensionArray will pass typing checks.

@Dr-Irv Dr-Irv requested a review from twoertwein February 25, 2023 21:40
@property
def type(self) -> type_t: ...
type: type_t
na_value: Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na_value seems to be a property (but the return value can be Any for some classes)

Copy link
Collaborator Author

@Dr-Irv Dr-Irv Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out while pandas implements them as properties, from a typing perspective, they are really class variables, so I annotated them that way in the next commit.

Throughout pandas, we just set type = xxx and name = yyy etc. in the subclasses, so changing them to ClassVar makes that work for typing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surpsrised if they are used inconsistently within pandas :( in some places as properties, in some as normal attributes and in some as class variables. I just grepped through pandas/core/dtypes, I didn't find a single line that assigns a value to a class variable na_value or type. It seems consistently to be a property.

If they are class variables, can you add a quick isinstance(ClassName.na_value, ...) test.

If the pandas tests set it for simplicity/laziness to a class varaible (which is fine with me), we can just add an ignore to them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surpsrised if they are used inconsistently within pandas :( in some places as properties, in some as normal attributes and in some as class variables. I just grepped through pandas/core/dtypes, I didn't find a single line that assigns a value to a class variable na_value or type. It seems consistently to be a property.

Take a look at pandas/core/arrays/integer.py. You'll see assignments of type = for the various Integer dtypes.

I believe it's declared as a property in pandas/core/dtypes/base.py so that if someone doesn't define it, it raises AbstractMethodError. But it is used internally as a ClassVar.

For na_value, it is inconsistent, but based on the example for DecimalDtype, it was treated as a class variable there as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left name and type as ClassVar, but changed na_value to be a property, because it has a default value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at pandas/core/arrays/integer.py. You'll see assignments of type = for the various Integer dtypes.

I wounder why mypy/pyright didn't catch that. Maybe because a parent class re-declares type as Any or because of the decorators that mypy doesn't understand. Seems like something that should be fixed in pandas.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wounder why mypy/pyright didn't catch that. Maybe because a parent class re-declares type as Any or because of the decorators that mypy doesn't understand. Seems like something that should be fixed in pandas.

I was wondering the same thing.

In reality, type and name are abstract class properties of ExtensionDtype, but the type checkers aren't supporting that concept. See long discussions here: microsoft/pyright#2601 and python/mypy#8993

@twoertwein twoertwein merged commit 76f7bc4 into pandas-dev:main Feb 27, 2023
@twoertwein
Copy link
Member

Thanks @Dr-Irv !

Comment on lines -169 to +170
Axis: TypeAlias = str | int
AxisInt: TypeAlias = Literal[0, 1]
Axis: TypeAlias = AxisInt | Literal["index", "columns", "rows"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein @Dr-Irv While I understand the intention, this one causes a problem with the following types from Pandas:

pandas.core.frame.DataFrame.reorder_levels(self, order: Sequence[Axis], axis: Axis = 0)

The order parameter expects a sequence of ints or strings - with this change this is forbidden.

From my code:

reportquery.py:612: error: Argument 1 to
"reorder_levels" of "DataFrame" has incompatible type "Sequence[str]"; expected
"Sequence[Union[Literal[0, 1], Literal['index', 'columns', 'rows']]]"
[arg-type]
            df = df.reorder_levels(column_names, axis=1)

PS: Sorry for not creating an issue (yet). I am short on time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: Sorry for not creating an issue (yet). I am short on time.

When you do have the time, please create an issue with a full code sample.

This has picked up that reorder_levels() has an incorrect type for order. It should be Sequence[Hashable]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we go: #560

twoertwein pushed a commit to twoertwein/pandas-stubs that referenced this pull request Apr 1, 2023
* Support an ExtensionDtype and ExtensionArray

* update TypeGuard usage.  Test astype for Decimal

* use ClassVar.  Change is_float to accept numpy.  Remove self: Self

* remove declarations of private funcs. make na_value a property
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants