-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TYP: ExtensionArray.take accept np.ndarray #43418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
twoertwein
commented
Sep 5, 2021
•
edited
Loading
edited
- closes TYP: ensure_platform_int probably needs a dtype annotation #43391 (np.ndarray is never a Sequence)
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
Using the |
in the base class the docstring for take is |
does |
from https://pandas.pydata.org/pandas-docs/dev/development/contributing_docstring.html#parameter-types
having different rules for the docstring parameter types and the type annotations is bound to create confusion but IIUC we want to keep the docstring rules for consistency with the ecosystem. There is no definition of the term "sequence" in the docstring guide. np.ndarray is not an instance of abc.Sequence or type compatible with the typing protocol. so if we want to annotate using Sequence we need to Union with np.ndarray (and Series, Index, EA too in most cases) see #28770 for related issue. |
I think the reason "sequence" is used in the docstring type parameters section instead of array-like is because we don't accept zero or multi dimension arrays. see #41807 for discussion. see also #32773, #32773 (comment), #32773 (comment) and #32033 for discussions relating to avoiding the numpydoc type definitions for internal docstrings to therefore avoid the discrepancies (not relevant here as the base class docstrings are our api definitions for 3rd party EA developers) if we could sometime in the future use the standard type notation in the public dosctrings that would be great (and redundant) |
Think we have two options then:
|
no need to close. accepting np.ndarray for take as Jeff says is sort of a given since numpy.take accepts array-like which can be any dimension. It only takes in 1 dimension but will populate a multidimensional array if the indices is mutli-dimensional. EAs a predominantly 1d (causes issues with __array_function__ protocol extending the EA interface to be 2d is another topic) The goal is to add types such that we don't add code in the main codebase that does not pass mypy silently when we break the EA contract with 3rd party EA developers (we effectively broke the contract in #41258 by adding so here okay to add np.ndarray, but not okay to add slice, 3rd party EAs aren't expecting a slice. If we add to the type annotations and then pass a slice from the main code, mypy won't report any issues. |
In general, type annotations should be as permissive as possible. we can't do that when typing the EA methods as we can't allow wider types than the 3rd party EA developers are expecting. |
I would question whether So maybe |
narrowing the types would not break 3rd party EAs but could break user code. We're quite constrained here without changing the EA api. to be fair though EA.take is not part of the public api for users. Index.take and Series.take could continue to be permissive and the user passed arg could be cast before dispatching to the EA. |
I think the question is what defines the EA API? Is it the signature in the code? Is it what is in the docs as the signature? Is it what is in the docs that is in the docstring? the docs don't define the types for the argument If the signature in the code defines the API, are we expecting people to read those signatures by looking directly at pandas source? Shouldn't the docs be the definition of the API? Finally, I don't think changing the signature would break user code - it might just break it if they did type checking on their code. |
none of the above. only jesting but probably the EA base tests should be in this list.
it's defined in the parameter types section of the docstring. Type annotations across the codebase are incomplete and are not yet part of the official api. (IIRC we had a discussion in the past about removing the type annotations from the published docs)
sure. but once we change the type annotations that's what mypy checks the code against for consistency. The point of inline type annotations is to prevent type related errors being introduced and latent bugs to be discovered. Adding type annotations as a sort of second level of documentation of what we would like the types to be renders the typing exercise pointless IMHO. So if we change the annotations it won't immediately break user code (if no code changes are required to get mypy to green). given. but subsequent code changes could and we won't have mypy to stop us. |
I think, except for maybe finding a better name for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @twoertwein lgtm, suggested change to alias
numpy definitions are currently
@overload
def take( # type: ignore[misc]
self: ndarray[Any, dtype[_ScalarType]],
indices: _IntLike_co,
axis: Optional[SupportsIndex] = ...,
out: None = ...,
mode: _ModeKind = ...,
) -> _ScalarType: ...
@overload
def take( # type: ignore[misc]
self,
indices: _ArrayLikeInt_co,
axis: Optional[SupportsIndex] = ...,
out: None = ...,
mode: _ModeKind = ...,
) -> ndarray[Any, _DType_co]: ...
@overload
def take(
self,
indices: _ArrayLikeInt_co,
axis: Optional[SupportsIndex] = ...,
out: _NdArraySubClass = ...,
mode: _ModeKind = ...,
) -> _NdArraySubClass: ...
_ArrayLikeInt_co = _ArrayLike[
"dtype[Union[bool_, integer[Any]]]",
Union[bool, int],
]
_ArrayLike = Union[
_NestedSequence[_SupportsArray[_DType]],
_NestedSequence[_T],
]
# TODO: Wait for support for recursive types
_NestedSequence = Union[
_T,
Sequence[_T],
Sequence[Sequence[_T]],
Sequence[Sequence[Sequence[_T]]],
Sequence[Sequence[Sequence[Sequence[_T]]]],
]
# The `_SupportsArray` protocol only cares about the default dtype
# (i.e. `dtype=None` or no `dtype` parameter at all) of the to-be returned
# array.
# Concrete implementations of the protocol are responsible for adding
# any and all remaining overloads
class _SupportsArray(Protocol[_DType_co]):
def __array__(self) -> ndarray[Any, _DType_co]: ...
_IntLike_co = Union[_BoolLike_co, int, np.integer]
_BoolLike_co = Union[bool, np.bool_]
except for maybe finding a better name for
TakeIndexer = Union[Sequence[int], np.ndarray]
so the numpy take indexer is Union[_IntLike_co, _ArrayLikeInt_co] and i'm not sure if numpy have a specific alias for this.
in time (with an update to the EA api), we may want to make EA.take more aligned with numpy.take, but for now we don't accept _Intlike_co?
so that just leaves the _ArrayLikeInt_co bit and we don't allow 2d (or nested) and don't allow bool in the take indexer either.
we already have a ArrayLike alias that is just np.array and ExtensionArray and an AnyArrayLike alias that also includes Index and Series but not Sequence.
So we probably can't use something like _ArrayLikeInt_co yet without extra confusion. xref #41807
IMO TakeIndexer is okay naming for now. (with the expectation that we will align our interface and naming conventions with NumPy over time)
Co-authored-by: Simon Hawkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @twoertwein lgtm.
(I think I saw some imports from pandas._typing inside TYPE_CHECKING blocks slip through in some recent EA typing PRs. So maybe some other places where the type definitions are not properly guarded)