BUG: Overloads for accessing columns of a DataFrame are too restrictive #97

bashtage · 2022-07-05T18:39:04Z

Describe the bug
Current overloads for accessing columns of a DataFrame are too restrictive

To Reproduce

from __future__ import annotations
import pandas as pd

df = pd.DataFrame({"a":[1,2,3],1:[3,4,5]})
key: list[int | str]
key = [1]
df[key]

returns

Invalid index type "List[Union[int, str]]" for "DataFrame"; expected type "Union[Tuple[Any, ...], Series[bool], DataFrame, List[str], List[Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta]], Index, ndarray[Any, dtype[str_]], ndarray[Any, dtype[bool_]], Sequence[Tuple[Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta], ...]]]"
``

I suspect this is a covariance issue (or invariance of `list`)

**Please complete the following information:**
- Windows 11
- Python 3.9

mypy 0.961
mypy-extensions 0.4.3
pandas 1.4.2
pandas-stubs 1.4.3.220704

The text was updated successfully, but these errors were encountered:

twoertwein · 2022-07-05T18:50:17Z

Ignoring non-list sequences: I think it might be sufficient to replace List[str] with list[HashableT] where HashableT = TypeVar("HashableT", bound=Hashable).

Ideally, we could simply use Sequence[Hashable] but a str is also a hashable sequence xref #82 (comment)

Dr-Irv · 2022-07-06T02:52:55Z

Ignoring non-list sequences: I think it might be sufficient to replace List[str] with list[HashableT] where HashableT = TypeVar("HashableT", bound=Hashable).

list[HashableT] would be too wide. Would accept any hashable type, and that would include things that are not acceptable to pandas (e.g., a list of numpy arrays)

I'm a bit surprised that a type of List[Union[int, str]] didn't match List[Scalar] that is defined for DataFrame.__getitem__(). This may require further investigation. Could be a mypy bug. Should check with pyright.

bashtage · 2022-07-06T11:07:38Z

I don't think NumPy arrays are hashable.

>>> from typing import Hashable
>>> import numpy as np
>>> isinstance(np.array([1,2]),Hashable)
False

>>> hash(np.array([1,2]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 hash(np.array([1,2]))

TypeError: unhashable type: 'numpy.ndarray'

I'm also pretty sure that the individual column type is Hashable in that anything that can be hashed can be used as a column name, including strange things like None.

There are many references in _typing.py to Hashable in regards to column names.

Dr-Irv · 2022-07-06T12:12:28Z

I don't think NumPy arrays are hashable.

From a dynamic typing perspective, they are not. But they might be from the perspective of static typing. I'm not sure.

There was a point in time (just last week) where I tried to change one of our arguments in an overloaded signature from something like List[str] to List[Hashable] and then mypy complained about overlapping signatures with a numpy array. Here's an example:

from typing import Hashable, List, TypeVar, Union
import numpy as np

tt: List[Union[int, str]]
tt = [1]


def func(a: List[Union[int, str, bool]]):
    print(a)


func(tt)

HashableT = TypeVar("HashableT", bound=Hashable)


def func2(a: List[HashableT]):
    print(a)


func2(tt)

nt = [np.array([1, 2, 3])]
func2(nt)

With the above code, mypy will not complain.
pyright will complain about the line func(tt), but will also allow func2(nt).

So my point here is that a list of numpy array matches List[HashableT].

bashtage · 2022-07-06T15:34:01Z

So my point here is that a list of numpy array matches List[HashableT].

I think this can only be a bug in pyright. If you look at NumPy there is no mention of Hashable relating to typing. IMO pandas typing should be as accurate as possible, which might mean calling badly behaved type checking.

bashtage · 2022-07-06T15:36:32Z

I'm a bit surprised that a type of List[Union[int, str]] didn't match List[Scalar] that is defined for DataFrame.__getitem__(). This may require further investigation. Could be a mypy bug. Should check with pyright.

I'm pretty sure this is due to the invariance of list. list[str | int] is different from list[int] and list[str]. A covariant container is needed, which is usually Sequence, although this doesn't work due to str behavior.

twoertwein · 2022-07-06T15:44:50Z

The type annotations of NumPy declare ndarray incorrectly as hashable, I opened a bug at numpy: numpy/numpy#21930

Dr-Irv · 2022-07-06T15:46:45Z

I think this can only be a bug in pyright. If you look at NumPy there is no mention of Hashable relating to typing. IMO pandas typing should be as accurate as possible, which might mean calling badly behaved type checking.

It's an issue with mypy as well. In my example, mypy is also allowing that code to pass (i.e., passing a List[np.ndarray] into List[HashableT])

twoertwein · 2022-07-06T15:59:09Z

A covariant container is needed, which is usually Sequence, although this doesn't work due to str behavior.

After numpy fixes their __hash__ annotation (it is not an issue with mypy/pyright), we could use list[HashableT]. While list is not covariant, mypy/pyright treat unbound TypeVar in (most) containers as covariant.

Dr-Irv · 2022-07-06T16:12:33Z

I'm pretty sure this is due to the invariance of list. list[str | int] is different from list[int] and list[str]. A covariant container is needed, which is usually Sequence, although this doesn't work due to str behavior.

Yes, so the question is if we make any changes here.

If you remove the line key: list[int | str] in your example, then it gets past mypy, but not pyright because of the invariance of list. We could add List[int] to

pandas-stubs/pandas-stubs/core/frame.pyi

Line 366 in 451233b

List[_str],

but that wouldn't cover the case of df[[1, "abc"]] (which would be quite unusual anyway to have a mix of types in the column names)

Another idea to try is to change in _typing.pyi, Scalar = Union[int, str, etc...] to Scalar = TypeVar("Scalar", bound=Union[int, str, etc...]) . Based on my read on the python docs, that will mean that any subtype of the Union would match. In fact, modifying my example above, this passes mypy and pyright:

from typing import Hashable, List, TypeVar, Union
import numpy as np

BigType = TypeVar("BigType", bound=Union[int, str, bool])

tt: List[Union[int, str]]
tt = [1]


def func(a: List[BigType]):
    print(a)

I may look at that later, or hoping someone else can beat me to it!

BvB93 · 2022-07-06T16:25:01Z

I recall that the there were a few mypy related bugs to Hashable, hence why dictionary key-types still aren’t declared as Hashable subtypes (xref python/typeshed#4638). It would be wise to double check wether these issues could be relevant here before committing to Hashable.

twoertwein · 2022-07-06T16:46:15Z

Pandas-stubs has almost the same issue as numpy (only difference is that the return type is somehow lost):

import pandas as pd

pd.DataFrame().__hash__ is None  # True
reveal_type(pd.DataFrame().__hash__)

mypy: Revealed type is "def () -> Any"
pyright: Type of "pd.DataFrame().__hash__" is "() -> Unknown"

twoertwein mentioned this issue Jul 7, 2022

TYP: Series/DataFrame/Index are not Hashable #113

Merged

2 tasks

Dr-Irv mentioned this issue Jul 10, 2022

Allow subtypes of List[Scalar] by introducing ScalarArg #124

Merged

2 tasks

twoertwein closed this as completed in #124 Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Overloads for accessing columns of a DataFrame are too restrictive #97

BUG: Overloads for accessing columns of a DataFrame are too restrictive #97

bashtage commented Jul 5, 2022

twoertwein commented Jul 5, 2022

Dr-Irv commented Jul 6, 2022

bashtage commented Jul 6, 2022 •

edited

Loading

Dr-Irv commented Jul 6, 2022

bashtage commented Jul 6, 2022

bashtage commented Jul 6, 2022

twoertwein commented Jul 6, 2022

Dr-Irv commented Jul 6, 2022

twoertwein commented Jul 6, 2022

Dr-Irv commented Jul 6, 2022

BvB93 commented Jul 6, 2022

twoertwein commented Jul 6, 2022

BUG: Overloads for accessing columns of a DataFrame are too restrictive #97

BUG: Overloads for accessing columns of a DataFrame are too restrictive #97

Comments

bashtage commented Jul 5, 2022

twoertwein commented Jul 5, 2022

Dr-Irv commented Jul 6, 2022

bashtage commented Jul 6, 2022 • edited Loading

Dr-Irv commented Jul 6, 2022

bashtage commented Jul 6, 2022

bashtage commented Jul 6, 2022

twoertwein commented Jul 6, 2022

Dr-Irv commented Jul 6, 2022

twoertwein commented Jul 6, 2022

Dr-Irv commented Jul 6, 2022

BvB93 commented Jul 6, 2022

twoertwein commented Jul 6, 2022

bashtage commented Jul 6, 2022 •

edited

Loading