Skip to content

BUG: Overloads for accessing columns of a DataFrame are too restrictive #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bashtage opened this issue Jul 5, 2022 · 12 comments · Fixed by #124
Closed

BUG: Overloads for accessing columns of a DataFrame are too restrictive #97

bashtage opened this issue Jul 5, 2022 · 12 comments · Fixed by #124

Comments

@bashtage
Copy link
Contributor

bashtage commented Jul 5, 2022

Describe the bug
Current overloads for accessing columns of a DataFrame are too restrictive

To Reproduce

from __future__ import annotations
import pandas as pd

df = pd.DataFrame({"a":[1,2,3],1:[3,4,5]})
key: list[int | str]
key = [1]
df[key]

returns

Invalid index type "List[Union[int, str]]" for "DataFrame"; expected type "Union[Tuple[Any, ...], Series[bool], DataFrame, List[str], List[Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta]], Index, ndarray[Any, dtype[str_]], ndarray[Any, dtype[bool_]], Sequence[Tuple[Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta], ...]]]"
``

I suspect this is a covariance issue (or invariance of `list`)

**Please complete the following information:**
- Windows 11
- Python 3.9

mypy 0.961
mypy-extensions 0.4.3
pandas 1.4.2
pandas-stubs 1.4.3.220704

@twoertwein
Copy link
Member

Ignoring non-list sequences: I think it might be sufficient to replace List[str] with list[HashableT] where HashableT = TypeVar("HashableT", bound=Hashable).

Ideally, we could simply use Sequence[Hashable] but a str is also a hashable sequence xref #82 (comment)

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 6, 2022

Ignoring non-list sequences: I think it might be sufficient to replace List[str] with list[HashableT] where HashableT = TypeVar("HashableT", bound=Hashable).

list[HashableT] would be too wide. Would accept any hashable type, and that would include things that are not acceptable to pandas (e.g., a list of numpy arrays)

I'm a bit surprised that a type of List[Union[int, str]] didn't match List[Scalar] that is defined for DataFrame.__getitem__(). This may require further investigation. Could be a mypy bug. Should check with pyright.

@bashtage
Copy link
Contributor Author

bashtage commented Jul 6, 2022

I don't think NumPy arrays are hashable.

>>> from typing import Hashable
>>> import numpy as np
>>> isinstance(np.array([1,2]),Hashable)
False

>>> hash(np.array([1,2]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 hash(np.array([1,2]))

TypeError: unhashable type: 'numpy.ndarray'

I'm also pretty sure that the individual column type is Hashable in that anything that can be hashed can be used as a column name, including strange things like None.

There are many references in _typing.py to Hashable in regards to column names.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 6, 2022

I don't think NumPy arrays are hashable.

From a dynamic typing perspective, they are not. But they might be from the perspective of static typing. I'm not sure.

There was a point in time (just last week) where I tried to change one of our arguments in an overloaded signature from something like List[str] to List[Hashable] and then mypy complained about overlapping signatures with a numpy array. Here's an example:

from typing import Hashable, List, TypeVar, Union
import numpy as np

tt: List[Union[int, str]]
tt = [1]


def func(a: List[Union[int, str, bool]]):
    print(a)


func(tt)

HashableT = TypeVar("HashableT", bound=Hashable)


def func2(a: List[HashableT]):
    print(a)


func2(tt)

nt = [np.array([1, 2, 3])]
func2(nt)

With the above code, mypy will not complain.
pyright will complain about the line func(tt), but will also allow func2(nt).

So my point here is that a list of numpy array matches List[HashableT].

@bashtage
Copy link
Contributor Author

bashtage commented Jul 6, 2022

So my point here is that a list of numpy array matches List[HashableT].

I think this can only be a bug in pyright. If you look at NumPy there is no mention of Hashable relating to typing. IMO pandas typing should be as accurate as possible, which might mean calling badly behaved type checking.

@bashtage
Copy link
Contributor Author

bashtage commented Jul 6, 2022

I'm a bit surprised that a type of List[Union[int, str]] didn't match List[Scalar] that is defined for DataFrame.__getitem__(). This may require further investigation. Could be a mypy bug. Should check with pyright.

I'm pretty sure this is due to the invariance of list. list[str | int] is different from list[int] and list[str]. A covariant container is needed, which is usually Sequence, although this doesn't work due to str behavior.

@twoertwein
Copy link
Member

The type annotations of NumPy declare ndarray incorrectly as hashable, I opened a bug at numpy: numpy/numpy#21930

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 6, 2022

I think this can only be a bug in pyright. If you look at NumPy there is no mention of Hashable relating to typing. IMO pandas typing should be as accurate as possible, which might mean calling badly behaved type checking.

It's an issue with mypy as well. In my example, mypy is also allowing that code to pass (i.e., passing a List[np.ndarray] into List[HashableT])

@twoertwein
Copy link
Member

A covariant container is needed, which is usually Sequence, although this doesn't work due to str behavior.

After numpy fixes their __hash__ annotation (it is not an issue with mypy/pyright), we could use list[HashableT]. While list is not covariant, mypy/pyright treat unbound TypeVar in (most) containers as covariant.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 6, 2022

I'm pretty sure this is due to the invariance of list. list[str | int] is different from list[int] and list[str]. A covariant container is needed, which is usually Sequence, although this doesn't work due to str behavior.

Yes, so the question is if we make any changes here.

If you remove the line key: list[int | str] in your example, then it gets past mypy, but not pyright because of the invariance of list. We could add List[int] to

but that wouldn't cover the case of df[[1, "abc"]] (which would be quite unusual anyway to have a mix of types in the column names)

Another idea to try is to change in _typing.pyi, Scalar = Union[int, str, etc...] to Scalar = TypeVar("Scalar", bound=Union[int, str, etc...]) . Based on my read on the python docs, that will mean that any subtype of the Union would match. In fact, modifying my example above, this passes mypy and pyright:

from typing import Hashable, List, TypeVar, Union
import numpy as np

BigType = TypeVar("BigType", bound=Union[int, str, bool])

tt: List[Union[int, str]]
tt = [1]


def func(a: List[BigType]):
    print(a)

I may look at that later, or hoping someone else can beat me to it!

@BvB93
Copy link

BvB93 commented Jul 6, 2022

I recall that the there were a few mypy related bugs to Hashable, hence why dictionary key-types still aren’t declared as Hashable subtypes (xref python/typeshed#4638). It would be wise to double check wether these issues could be relevant here before committing to Hashable.

@twoertwein
Copy link
Member

Pandas-stubs has almost the same issue as numpy (only difference is that the return type is somehow lost):

import pandas as pd

pd.DataFrame().__hash__ is None  # True
reveal_type(pd.DataFrame().__hash__)
mypy: Revealed type is "def () -> Any"
pyright: Type of "pd.DataFrame().__hash__" is "() -> Unknown"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants