Skip to content

TYP: Numpy compatibility of definition of "array like" #41807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Dr-Irv opened this issue Jun 3, 2021 · 16 comments
Open

TYP: Numpy compatibility of definition of "array like" #41807

Dr-Irv opened this issue Jun 3, 2021 · 16 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Typing type annotations, mypy/pyright type checking

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 3, 2021

In pandas/_typing.py, we define ArrayLike as:

ArrayLike = Union["ExtensionArray", np.ndarray]

In the numpy glossary https://numpy.org/doc/stable/glossary.html?highlight=array_like, numpy defines array_like as:

Any scalar or sequence that can be interpreted as an ndarray. In addition to ndarrays and scalars this category includes lists (possibly nested and with different element types) and tuples. Any argument accepted by numpy.array is array_like.

Are we creating confusion by using the term ArrayLike to only mean arrays, whereas numpy defines it to include scalars?

This came up in terms of reconciling the arguments of np.searchsorted() and ExtensionArray.searchsorted().

Comments from @jbrockmendel and @simonjayhawkins welcome.

@Dr-Irv Dr-Irv added Code Style Code style, linting, code_checks ExtensionArray Extending pandas with custom dtypes or arrays. Typing type annotations, mypy/pyright type checking labels Jun 3, 2021
@jbrockmendel
Copy link
Member

Are we creating confusion by using the term ArrayLike to only mean arrays, whereas numpy defines it to include scalars?

I haven't seen this come up elsewhere. If anything, numpy's definition you quoted seems like it doesn't rule out much of anything.

@simonjayhawkins
Copy link
Member

Any argument accepted by numpy.array is array_like.

we would probably want to see if that is consistent with out Series and DataFrame constructors and use it there.

in other parts of the code, we probably pass around EAs or numpy arrays so our alias is fine in those cases.

Some public methods may allow wider types than our ArrayLike alias, for those cases we probably use AnyArrayLike, which includes Series and Index or add other allowable types. Some of the public methods where we do this may benefit from a wider alias that AnyArrayLike and maybe could consider enhancing the api in places for compatibility with numpy array-like definition.

For return types, the type should be as precise as possible in annotations, so a wide array-like alias has no relevance there, so the return type of searchsorted should be restricted to the actual types returned, and not array-like to match the numpy docs

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 4, 2021

Are we creating confusion by using the term ArrayLike to only mean arrays, whereas numpy defines it to include scalars?

I haven't seen this come up elsewhere. If anything, numpy's definition you quoted seems like it doesn't rule out much of anything.

That's my point. I was confused because when you see numpy use the array_like, I thought it really meant an array, but it doesn't.

So we have methods like searchsorted() and repeat() in ExtensionArray, where we say they are "similar" to numpy, which is requiring a numpy definition of "array_like", and should we support what numpy calls "array_like", or do we say "In pandas, ArrayLike means "it's an array, not a scalar" ?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 4, 2021

we would probably want to see if that is consistent with out Series and DataFrame constructors and use it there.

For Series, we document it as "array-like, Iterable, dict, or scalar value". pd.Series(2) works. But we are explicit about that.

My question is whether we continue in this direction, i.e.,

  • numpy "array-like" means an array, Iterable, or scalar
  • pandas "array-like" means an array (includes ExtensionArray) or np.ndarray', but does NOT include other Iterable types

Are we confusing users because of numpy's definition being different than ours? I'm not a fan of their definition, but I don't think we can change that.

@simonjayhawkins
Copy link
Member

Are we confusing users because of numpy's definition being different than ours?

I think we use array-like in the public docstrings in many places. those should be in places where the accepted types are the same as numpy array-like. (with the possible exception of Scalar)

  • pandas "array-like" means an array (includes ExtensionArray) or np.ndarray', but does NOT include other Iterable types

I don't think that's true. array-like in docstrings probably means that other iterable types are allowed. Do not confuse the doc-string types with the annotation aliases.

from https://pandas.pydata.org/pandas-docs/dev/development/contributing_docstring.html#section-3-parameters

If the exact type is not relevant, but must be compatible with a NumPy array, array-like can be specified. If Any type that can be iterated is accepted, iterable can be used:
array-like
iterable

EAs should probably support the __array__ protocol so array-like in a docstring also implicitly includes EAs (we do support the protocol for internal EAs)

The naming we use for the typing alias may not be an issue. I think when/if the types are published they are expanded and when mypy reports errors it also expands the alias.

so there shouldn't be confusion for users. but sure we could probably use an alias representing numpy array-like, that we use in public methods where we state in the docstring that the input is array-like (and check that we do actually support the types, well mypy should do that for us, and fix where needed)

where our docstring is more restrictive in the allowable types, but passes them onto a numpy function that accepts array-like or is a function/method with the same name/purpose as a numpy function, then we could also considering updating those (the second being more of a priority/nice to do than the first)

for EA, we can't widen the accepted types without changing the EA api, 3rd party authors would need notice to test/change their code.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 4, 2021

  • pandas "array-like" means an array (includes ExtensionArray) or np.ndarray', but does NOT include other Iterable types

I don't think that's true. array-like in docstrings probably means that other iterable types are allowed. Do not confuse the doc-string types with the annotation aliases.

I made that statement by inference. The docstring for pd.Series documents the parameter data as "array-like, Iterable, dict, or scalar value". So that implies to me that the pandas definition of "array-like" doesn't include iterable types.

from https://pandas.pydata.org/pandas-docs/dev/development/contributing_docstring.html#section-3-parameters

If the exact type is not relevant, but must be compatible with a NumPy array, array-like can be specified. If Any type that can be iterated is accepted, iterable can be used:
array-like
iterable

And I'm questioning whether we are consistent through the docs, and when we say "array-like" in the docs, do we mean "numpy array-like" or "something like an array, but not a scalar" ?

EAs should probably support the array protocol so array-like in a docstring also implicitly includes EAs (we do support the protocol for internal EAs)

The naming we use for the typing alias may not be an issue. I think when/if the types are published they are expanded and when mypy reports errors it also expands the alias.

I think it might be an issue for pandas contributors to understand the difference. My confusion arose because when I saw numpy say "array-like", I thought it was similar to our type ArrayLike, but it's not because of the numpy definition.

so there shouldn't be confusion for users. but sure we could probably use an alias representing numpy array-like, that we use in public methods where we state in the docstring that the input is array-like (and check that we do actually support the types, well mypy should do that for us, and fix where needed)

So are you saying that when the pandas docs say "array-like", we should take that to mean "numpy array-like" ?

where our docstring is more restrictive in the allowable types, but passes them onto a numpy function that accepts array-like or is a function/method with the same name/purpose as a numpy function, then we could also considering updating those (the second being more of a priority/nice to do than the first)

for EA, we can't widen the accepted types without changing the EA api, 3rd party authors would need notice to test/change their code.

So this is where the confusion is. I don't know how we are defining "array-like" in the EA API

  • For ExtensionArray.fillna(), the value parameter is documented as "scalar, array-like" . Does "array-like" mean "numpy array-like" or something else? If so, shouldn't we then delete "scalar" from the doc here?
  • For ExtensionArray.searchsorted(), the value parameter is documented as "array-like" . Does "array-like" mean "numpy array-like" or something else?
  • For ExtensionArray.repeat(), the repeats parameter is documented as "int or array of ints" . For numpy, the parameter is "array-like" . Do we change the docs here to correspond to numpy "array-like" ?

There may be other examples as well. These are the ones I've dealt with.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 9, 2021

Based on dev discussion on 6/9/2021, pandas will define "array-like" to mean "something like an array", which does not include scalars. Need to check the docs that we are consistent between examples and documented API to see if we are consistent.

@jbrockmendel
Copy link
Member

something like an array

Maybe "sequence with dtype attribute"? i think that excludes 0-dim ndarrays (which i view as desirable)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 9, 2021

something like an array

Maybe "sequence with dtype attribute"? i think that excludes 0-dim ndarrays (which i view as desirable)

No, because in many cases, we don't require the dtype. E.g. pd.Series([1,2,3]), and we are calling a list [1,2,3] "array-like"

@BvB93
Copy link

BvB93 commented Jun 10, 2021

Based on dev discussion on 6/9/2021, pandas will define "array-like" to mean "something like an array"

Another option would to explicitly include the dimensionality of the array-like object, e.g. >=1D array-like, 2D array-like, etc.
If so desired, then this would allow you to remain compatibility with numpy's definition,
as scalars and 0D arrays specifically fall under the category of 0D array-likes.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Jun 11, 2021

so there shouldn't be confusion for users. but sure we could probably use an alias representing numpy array-like, that we use in public methods where we state in the docstring that the input is array-like (and check that we do actually support the types, well mypy should do that for us, and fix where needed)

see scikit-learn/scikit-learn#16705 (comment)

we could maybe just have the alias for numpy.typing in our pandas._typing, import numpy.typing as npt (and re-export, done implicititly) and then when we use the numpy types (aliases) they are prefixed with npt. in the function signature to improve clarity (i.e. npt.ArrayLike). (The aliases get expanded for messages etc, so the alias is less relevant).

maybe this would also simplify #41185. we could gradually replace NpDtype -> npt.DTypeLike instead of trying to migrate in one go.

@simonjayhawkins
Copy link
Member

we could maybe just have the alias for numpy.typing in our pandas._typing, import numpy.typing as npt (and re-export, done implicititly) and then when we use the numpy types (aliases) they are prefixed with npt. in the function signature to improve clarity (i.e. npt.ArrayLike). (The aliases get expanded for messages etc, so the alias is less relevant).

maybe this would also simplify #41185. we could gradually replace NpDtype -> npt.DTypeLike instead of trying to migrate in one go.

opened #41945

@bluenote10
Copy link

Is this issue basically addressing the problem that the following does not type check? (Or should we track that in a separate issue?)

import numpy.typing as npt
import pandas as pd

def takes_arraylike(a: npt.ArrayLike) -> None:
    ...

takes_arraylike(pd.Series([1, 2, 3]).values)
takes_arraylike(pd.Series([1, 2, 3]).array)

I would have expected these two expression to type check properly, i.e., a compatiblity between the returned ExtensionArray and numpy.testing.ArrayLike. Currently this fails to type check with:

error: Argument 1 to "takes_arraylike" has incompatible type "ExtensionArray | ndarray[Any, Any]"; expected "_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]"  [arg-type]
error: Argument 1 to "takes_arraylike" has incompatible type "ExtensionArray"; expected "_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]"  [arg-type]

Interestingly, takes_arraylike(pd.Series([1, 2, 3])) itself does type check!

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 6, 2023

Is this issue basically addressing the problem that the following does not type check?

Your example doesn't type check because values is ExtensionArray | np.ndarray while array is ExtensionArray, and ExtensionArray doesn't support all of the protocols that npt.ArrayLike does.

@bluenote10
Copy link

and ExtensionArray doesn't support all of the protocols that npt.ArrayLike does.

... which comes as a surprise consider that npt.ArrayLike is a rather broad type that supports a variety of things, and one would intuitively expected pd.Series.value and pd.Series.array to be amongst them, no?

So this means it is actually possible to produce a runtime error by passing a pd.Series.array or pd.Series.values into a function that takes array: npt.ArrayLike? Prior to type annotations I always thought it is safe to assume that they are "sufficiently array like" to be used in places where numpy arrays are expected. I'm wondering what the code would have to do to fail the assumption at runtime?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 6, 2023

.. which comes as a surprise consider that npt.ArrayLike is a rather broad type that supports a variety of things, and one would intuitively expected pd.Series.value and pd.Series.array to be amongst them, no?

The issue here is that npt.ArrayLike has protocols that pd.Series.value and pd.Series.array do not support, so from a typing perspective, there is not a match. So npt.ArrayLike is "wider" than ExtensionArray.

So this means it is actually possible to produce a runtime error by passing a pd.Series.array or pd.Series.values into a function that takes array: npt.ArrayLike?

Yes, a function/method that accepts npt.ArrayLike might be dependent on one of those protocols that are not supported by ExtensionArray. pandas would have to do one of the following for this to work:

  • Add __array__() to ExtensionArray .
  • Add __reversed__(), count, index, to ExtensionArray and change the return type of ExtensionArray.__contains__() to bool instead of bool | np.bool_

Prior to type annotations I always thought it is safe to assume that they are "sufficiently array like" to be used in places where numpy arrays are expected. I'm wondering what the code would have to do to fail the assumption at runtime?

The differences here are due to subtleties in how the typing is done in numpy and pandas. In addition, this goes back to the original issue I created - the words ArrayLike used in numpy and pandas as well as the phrase "array-like" in the pandas documentation are not precise, and between numpy and pandas, they are not the same when it comes to their definitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

No branches or pull requests

7 participants