TYP: ExtensionArray.take accept np.ndarray #43418

twoertwein · 2021-09-05T15:08:33Z

closes TYP: ensure_platform_int probably needs a dtype annotation #43391 (np.ndarray is never a Sequence)
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

twoertwein · 2021-09-10T15:14:51Z

Using the SequenceIndexer from #41258. This means that Sequence[int] is not any more allowed (only List[int], slice, np.ndarray). I think any sequence should work for take, as all of them end up in np.asarray. Happy to allow Sequence[int] for take (but it would also be great to be consistent across methods: not allowing any sequence). I favor consistency.

@Dr-Irv

simonjayhawkins · 2021-09-11T13:32:51Z

in the base class the docstring for take is indices : sequence of int. This is our contract with 3rd party EA developers.

pandas/_typing.py

jreback · 2021-09-12T15:10:34Z

in the base class the docstring for take is indices : sequence of int. This is our contract with 3rd party EA developers.

does Sequence[int] include a suitable ndarray, e.g. ndarray[intp] (prob no?) if so we should extend .take to do this as its quite natural and IMHO expected.

simonjayhawkins · 2021-09-12T15:27:05Z

from https://pandas.pydata.org/pandas-docs/dev/development/contributing_docstring.html#parameter-types

If the exact type is not relevant, but must be compatible with a NumPy array, array-like can be specified. If Any type that can be iterated is accepted, iterable can be used:

array-like
iterable

having different rules for the docstring parameter types and the type annotations is bound to create confusion but IIUC we want to keep the docstring rules for consistency with the ecosystem.

There is no definition of the term "sequence" in the docstring guide.

np.ndarray is not an instance of abc.Sequence or type compatible with the typing protocol.

so if we want to annotate using Sequence we need to Union with np.ndarray (and Series, Index, EA too in most cases)

see #28770 for related issue.

simonjayhawkins · 2021-09-12T16:02:50Z

I think the reason "sequence" is used in the docstring type parameters section instead of array-like is because we don't accept zero or multi dimension arrays. see #41807 for discussion.

see also #32773, #32773 (comment), #32773 (comment) and #32033 for discussions relating to avoiding the numpydoc type definitions for internal docstrings to therefore avoid the discrepancies (not relevant here as the base class docstrings are our api definitions for 3rd party EA developers)

if we could sometime in the future use the standard type notation in the public dosctrings that would be great (and redundant)

twoertwein · 2021-09-12T17:27:25Z

I think the reason "sequence" is used in the docstring type parameters section instead of array-like is because we don't accept zero or multi dimension arrays. see #41807 for discussion.

Think we have two options then:

Use Sequence[int] | np.ndarray and in the future add shape information to the np.ndarray when that becomes standardized in numpy
Keep the current state (close this PR)

simonjayhawkins · 2021-09-12T17:43:01Z

no need to close. accepting np.ndarray for take as Jeff says is sort of a given since take is effectively a numpy function.

numpy.take accepts array-like which can be any dimension. It only takes in 1 dimension but will populate a multidimensional array if the indices is mutli-dimensional. EAs a predominantly 1d (causes issues with __array_function__ protocol extending the EA interface to be 2d is another topic)

The goal is to add types such that we don't add code in the main codebase that does not pass mypy silently when we break the EA contract with 3rd party EA developers (we effectively broke the contract in #41258 by adding list[int]: A list of int to the accepted types for __getitem__, 3rd party code could break)

so here okay to add np.ndarray, but not okay to add slice, 3rd party EAs aren't expecting a slice. If we add to the type annotations and then pass a slice from the main code, mypy won't report any issues.

simonjayhawkins · 2021-09-12T17:49:31Z

In general, type annotations should be as permissive as possible. we can't do that when typing the EA methods as we can't allow wider types than the 3rd party EA developers are expecting.

Dr-Irv · 2021-09-12T17:53:38Z

I think the reason "sequence" is used in the docstring type parameters section instead of array-like is because we don't accept zero or multi dimension arrays. see #41807 for discussion.

I would question whether take should accept Sequence[int] or just List[int], because I believe that Sequence[int] would include a tuple of ints, and if we want to be consistent between what .iloc and .loc do, those don't accept tuples.

So maybe take should only accept in its signature List[int] | np.ndarray[int] ??

simonjayhawkins · 2021-09-12T18:00:22Z

So maybe take should only accept in its signature List[int] | np.ndarray[int] ??

narrowing the types would not break 3rd party EAs but could break user code. We're quite constrained here without changing the EA api.

to be fair though EA.take is not part of the public api for users.

Index.take and Series.take could continue to be permissive and the user passed arg could be cast before dispatching to the EA.

Dr-Irv · 2021-09-12T18:17:53Z

narrowing the types would not break 3rd party EAs but could break user code. We're quite constrained here without changing the EA api.

I think the question is what defines the EA API? Is it the signature in the code? Is it what is in the docs as the signature? Is it what is in the docs that is in the docstring?

If you go to https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.take.html?highlight=take#pandas.api.extensions.ExtensionArray.take

the docs don't define the types for the argument indices, even though the signature in the source says Sequence[int] .

If the signature in the code defines the API, are we expecting people to read those signatures by looking directly at pandas source? Shouldn't the docs be the definition of the API?

Finally, I don't think changing the signature would break user code - it might just break it if they did type checking on their code.

simonjayhawkins · 2021-09-12T18:55:00Z

I think the question is what defines the EA API? Is it the signature in the code? Is it what is in the docs as the signature? Is it what is in the docs that is in the docstring?

none of the above. only jesting but probably the EA base tests should be in this list.

If you go to https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.take.html?highlight=take#pandas.api.extensions.ExtensionArray.take

the docs don't define the types for the argument indices

it's defined in the parameter types section of the docstring. Type annotations across the codebase are incomplete and are not yet part of the official api. (IIRC we had a discussion in the past about removing the type annotations from the published docs)

Finally, I don't think changing the signature would break user code - it might just break it if they did type checking on their code.

sure. but once we change the type annotations that's what mypy checks the code against for consistency. The point of inline type annotations is to prevent type related errors being introduced and latent bugs to be discovered. Adding type annotations as a sort of second level of documentation of what we would like the types to be renders the typing exercise pointless IMHO.

So if we change the annotations it won't immediately break user code (if no code changes are required to get mypy to green). given. but subsequent code changes could and we won't have mypy to stop us.

twoertwein · 2021-09-15T00:22:11Z

I think, except for maybe finding a better name for TakeIndexer = Union[Sequence[int], np.ndarray], this PR should be ready. I also checked all the affected take docs strings, they all define the type as sequence of int.

simonjayhawkins

Thanks @twoertwein lgtm, suggested change to alias

numpy definitions are currently

    @overload
    def take(  # type: ignore[misc]
        self: ndarray[Any, dtype[_ScalarType]],
        indices: _IntLike_co,
        axis: Optional[SupportsIndex] = ...,
        out: None = ...,
        mode: _ModeKind = ...,
    ) -> _ScalarType: ...
    @overload
    def take(  # type: ignore[misc]
        self,
        indices: _ArrayLikeInt_co,
        axis: Optional[SupportsIndex] = ...,
        out: None = ...,
        mode: _ModeKind = ...,
    ) -> ndarray[Any, _DType_co]: ...
    @overload
    def take(
        self,
        indices: _ArrayLikeInt_co,
        axis: Optional[SupportsIndex] = ...,
        out: _NdArraySubClass = ...,
        mode: _ModeKind = ...,
    ) -> _NdArraySubClass: ...

_ArrayLikeInt_co = _ArrayLike[
    "dtype[Union[bool_, integer[Any]]]",
    Union[bool, int],
]

_ArrayLike = Union[
    _NestedSequence[_SupportsArray[_DType]],
    _NestedSequence[_T],
]

# TODO: Wait for support for recursive types
_NestedSequence = Union[
    _T,
    Sequence[_T],
    Sequence[Sequence[_T]],
    Sequence[Sequence[Sequence[_T]]],
    Sequence[Sequence[Sequence[Sequence[_T]]]],
]

# The `_SupportsArray` protocol only cares about the default dtype
# (i.e. `dtype=None` or no `dtype` parameter at all) of the to-be returned
# array.
# Concrete implementations of the protocol are responsible for adding
# any and all remaining overloads
class _SupportsArray(Protocol[_DType_co]):
    def __array__(self) -> ndarray[Any, _DType_co]: ...

_IntLike_co = Union[_BoolLike_co, int, np.integer]

_BoolLike_co = Union[bool, np.bool_]

except for maybe finding a better name for TakeIndexer = Union[Sequence[int], np.ndarray]

so the numpy take indexer is Union[_IntLike_co, _ArrayLikeInt_co] and i'm not sure if numpy have a specific alias for this.

in time (with an update to the EA api), we may want to make EA.take more aligned with numpy.take, but for now we don't accept _Intlike_co?

so that just leaves the _ArrayLikeInt_co bit and we don't allow 2d (or nested) and don't allow bool in the take indexer either.

we already have a ArrayLike alias that is just np.array and ExtensionArray and an AnyArrayLike alias that also includes Index and Series but not Sequence.

So we probably can't use something like _ArrayLikeInt_co yet without extra confusion. xref #41807

IMO TakeIndexer is okay naming for now. (with the expectation that we will align our interface and naming conventions with NumPy over time)

pandas/_typing.py

Co-authored-by: Simon Hawkins <[email protected]>

pandas/core/arrays/base.py

pandas/_typing.py

simonjayhawkins

Thanks @twoertwein lgtm.

(I think I saw some imports from pandas._typing inside TYPE_CHECKING blocks slip through in some recent EA typing PRs. So maybe some other places where the type definitions are not properly guarded)

jreback added the Typing type annotations, mypy/pyright type checking label Sep 6, 2021

twoertwein mentioned this pull request Sep 6, 2021

TYP: Typing for ExtensionArray.__getitem__ #41258

Merged

jreback added this to the 1.4 milestone Sep 10, 2021

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Sep 10, 2021

Dr-Irv reviewed Sep 11, 2021

View reviewed changes

pandas/_typing.py Outdated Show resolved Hide resolved

simonjayhawkins approved these changes Sep 15, 2021

View reviewed changes

pandas/_typing.py Outdated Show resolved Hide resolved

twoertwein and others added 11 commits September 16, 2021 08:55

TYP: ExtensionArray.take accept np.ndarray

416a1a1

remove unused import

cf463df

NonScalarPositionalIndexer

a974905

use SequenceIndexer

65adf47

include Sequence

e296f0e

unused import

eba3f07

Sequence and ndarray

a05592b

Update pandas/_typing.py

d2aadd4

Co-authored-by: Simon Hawkins <[email protected]>

no mixed sequences

3c83003

forgot a comma

e651017

guard npt

1076a7e

fix

d31e1a9

Dr-Irv reviewed Sep 17, 2021

View reviewed changes

pandas/core/arrays/base.py Show resolved Hide resolved

twoertwein added 2 commits September 20, 2021 12:40

change docstring

f823378

use TYPE_CHECKING

58ac439

simonjayhawkins reviewed Sep 22, 2021

View reviewed changes

pandas/_typing.py Outdated Show resolved Hide resolved

use Any for TakeIndexer when not type checking

c9a02c0

simonjayhawkins approved these changes Sep 22, 2021

View reviewed changes

simonjayhawkins merged commit dd7158b into pandas-dev:master Sep 26, 2021

twoertwein deleted the take branch June 8, 2022 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TYP: ExtensionArray.take accept np.ndarray #43418

TYP: ExtensionArray.take accept np.ndarray #43418

twoertwein commented Sep 5, 2021 •

edited

Loading

twoertwein commented Sep 10, 2021 •

edited

Loading

simonjayhawkins commented Sep 11, 2021

jreback commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

twoertwein commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

Dr-Irv commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

Dr-Irv commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

twoertwein commented Sep 15, 2021

simonjayhawkins left a comment

simonjayhawkins left a comment

TYP: ExtensionArray.take accept np.ndarray #43418

TYP: ExtensionArray.take accept np.ndarray #43418

Conversation

twoertwein commented Sep 5, 2021 • edited Loading

twoertwein commented Sep 10, 2021 • edited Loading

simonjayhawkins commented Sep 11, 2021

jreback commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

twoertwein commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

Dr-Irv commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

Dr-Irv commented Sep 12, 2021

simonjayhawkins commented Sep 12, 2021

twoertwein commented Sep 15, 2021

simonjayhawkins left a comment

Choose a reason for hiding this comment

simonjayhawkins left a comment

Choose a reason for hiding this comment

twoertwein commented Sep 5, 2021 •

edited

Loading

twoertwein commented Sep 10, 2021 •

edited

Loading