TYP: try out TypeGuard #51309

jbrockmendel · 2023-02-10T18:36:27Z

This effectively makes mypy understand that is_integer(obj) is equivalent to isinstance(obj, (int, np.integer)). This is a mixed bag. It allows for getting rid of some redundant casts/ignores, but makes some new ones necessary. I'd be fine either way on this.

WillAyd

I think this is awesome. Even though it requires some fixes elsewhere I think it moves us in the right direction so +1 from me

WillAyd · 2023-02-10T22:19:29Z

pandas/core/indexes/multi.py

+                    # error: Incompatible types in assignment (expression has type
+                    # "Union[int, Any]", variable has type "Union[ndarray[Any,
+                    # dtype[signedinteger[Any]]], signedinteger[Any]]")
+                    loc -= 1  # type: ignore[assignment]


Might be a little rusty on my python typing but I think it just infers the value from the first appearing of a variable within a function. loc is up higher in the func from algos.searchsorted - would that ever return a scalar? If not we might just want to be explicit that loc can be an int as well

algos.searchsorted returns npt.NDArray[np.intp] | np.intp. A couple lines of we should have ruled out the ndarray case. so we have np.intp and its complaining that it isnt the same as int.

topper-123

I'm +1 on using TypeGuard, it can help with typing, but I'm probably -1 when there is a composite return type in the TypeGuard, if that just gives noise/ complications in other locations.

We also have to handle the situation when typing_extensions isn't installed and also use the built-in TypeGuard, if on python >= 3.10.

pandas/_libs/lib.pyi

topper-123 · 2023-02-10T22:05:26Z

pandas/_libs/lib.pyi

+def is_period(val: object) -> TypeGuard[Period]: ...
+def is_interval(val: object) -> TypeGuard[Interval]: ...
+def is_decimal(val: object) -> TypeGuard[Decimal]: ...
+def is_complex(val: object) -> TypeGuard[complex | np.complex_]: ...


This should be just TypeGuard[complex], as complext64 & complex128 are subtypes of the regular python type complex.

topper-123 · 2023-02-10T22:10:07Z

pandas/_libs/lib.pyi

+def is_decimal(val: object) -> TypeGuard[Decimal]: ...
+def is_complex(val: object) -> TypeGuard[complex | np.complex_]: ...
+def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
+def is_integer(val: object) -> TypeGuard[int | np.integer]: ...


I kind of dislike having composite types inside the type guard (because it probably gives some uglier code in other locations). Maybe only use TypeGuard when it gives a single type?

i get where youre coming from, but i also have an urge for "maximum accuracy" here. im going to keep it how it is for now but will be OK with changing if consensus develops

I would keep the Union (but it would be nicer without it). I think not having the Unions might cause some incorrect unreachable code warnings from mypy/pyright.

if is_integer(x): if isinstance(x, np.number): # I believe mypy would error here if it wouldn't return a Union x = x.item()

To some extent, this change is revealing an issue throughout the pandas code.

Sometimes, when we call is_integer(value) or is_bool(value), the intent is to check whether the value is an integer or numpy integer, or bool or numpy boolean. In other cases, we actually want to check that the value is NOT a numpy type (although I'm not 100% sure of that).

If it is the case that we accept a numpy integer/bool whenever we have typed an argument as int or bool, respectively, then the types used within the pandas functions should change to reflect that.

I think if we want to check that it is a python integer, we'd just do isinstance(x, int). My issue was more that by having a union type here, we'd have to have this union type in a lot of places, which doesn't feel like very clean code. My suggestion was to not type is_integer or other with union return types.

But ok, it was just a preference, I can also accept it, as overall this PR is a nice improvement.

My issue was more that by having a union type here, we'd have to have this usion type in a lot of places, which doesn't feel like very clean code.

Or for every place that such a change is necessary, it makes you analyze the code and determine whether you really want to change the test to isinstance(x, int) or modify other parts of the code knowing that a np.integer is valid.

pandas/_libs/lib.pyi

jbrockmendel · 2023-02-12T22:09:51Z

One place in util._validators where making pyright and mypy both pass at the same time is a hassle

/home/runner/work/pandas/pandas/pandas/util/validators.py:262:12 - error: Expression of type "bool | bool | int | BoolishNoneT@validate_bool_kwarg | None" cannot be assigned to return type "BoolishNoneT@validate_bool_kwarg"
Type "bool | bool_ | int | BoolishNoneT@validate_bool_kwarg | None" cannot be assigned to type "BoolishNoneT@validate_bool_kwarg" (reportGeneralTypeIssues)

jbrockmendel · 2023-02-12T22:10:44Z

Tried annotating Index._is_multi as TypeGuard[MultiIndex] hoping a bunch of existing ignores would become redundant, no luck

WillAyd · 2023-02-13T21:48:26Z

pandas/compat/numpy/function.py

@@ -222,7 +222,7 @@ def validate_cum_func_with_skipna(skipna, args, kwargs, name) -> bool:
        skipna = True

    validate_cum_func(args, kwargs, fname=name)
-    return skipna
+    return bool(skipna)


I'm not really a fan of explicitly creating new objects like this to appease typing; I think an ignore is preferable. Any idea how this gets out of whack in the first place?

I'm not really a fan of explicitly creating new objects like this to appease typing; I think an ignore is preferable

works for me

Any idea how this gets out of whack in the first place?

The return annotation wants bool but we could have np.bool_

Gotcha - yea I would just ignore for now then

WillAyd · 2023-02-13T21:48:58Z

pandas/core/frame.py

@@ -9936,6 +9932,7 @@ def _series_round(ser: Series, decimals: int):
            new_cols = list(_dict_round(self, decimals))
        elif is_integer(decimals):
            # Dispatch to Series.round
+            decimals = int(decimals)


Same comment here - don't think we should do the int construction

topper-123 · 2023-02-16T10:49:41Z

pandas/_typing.py

@@ -83,8 +83,11 @@
    # Name "npt._ArrayLikeInt_co" is not defined  [name-defined]
    NumpySorter = Optional[npt._ArrayLikeInt_co]  # type: ignore[name-defined]

+    from typing_extensions import TypeGuard  # pyright: reportUnusedImport = false
+


I'd import from standard lib if on python >= 3.10, else from typing_extensions.

This needs to take into account that typing_extensions may not be installed. All in all I'd probably do this something like this:

if PY310: from typing import TypeGuard else: try: from typing_extensions import TypeGuard except ImportError: TypeGuard = bool

Also add typing_extensions to requirements-dev.txt , I think it's possible to add it like

typing_extensions; python_version < '3.10'

does that work in an environment.yml file? requirements-dev.txt is auto-generated from the environment.yml.

checking for py310 requires a circular import which isnt great. im leaning towards abandoning this PR, would be happy if you'd like to get it across the finish line

Oh I didn't know it is autogenerated and I not knowledgeable with environment.yml to know if it`s possible there. I could take a look. I can also try take over, no problem.

topper-123 · 2023-02-16T10:58:35Z

pandas/core/indexes/api.py

+    objs,
+    intersect: bool = False,
+    axis: Axis = 0,
+    sort: bool | np.bool_ = True,


Maybe instead wrap arguments to this function in a bool, i.e. get_objs_combined_axis(..., sort=bool(sort), ...)?

If not that, maybe in in the _typing have union types, e.g. any_bool = Union[bool, np.bool_], any_int = union[int, np.int_] etc.?

topper-123 · 2023-02-16T12:48:58Z

pandas/_typing.py

 else:
    npt: Any = None
+    TypeGuard: Any = None


TypeGuard = bool? In my understanding a TypeGuard is like a bool that ensure that another has some type…

This should probably not be outside of the if TYPE_CHECKING section?

topper-123 · 2023-02-18T13:55:04Z

I've made a new version, could you take look?

I dislike union typeguard unless a clear reason, because they require casts or checks in other locations, but I've kept them anyway in this version. @jorisvandenbossche , what's your opinion on those, e.g. def is_integer(val: object) -> TypeGuard[int | np.integer]?

simonjayhawkins

Thanks @jbrockmendel

There is a discussion over in #51480 about typing_extensions that needs to be resolved first hence the request changes review.

Not looked in detail but from my attempts at using TypeGuard, agree that Union types can cause problems. (This was a while back and not up-to-date on best practice on this)

Seconding, agree with @WillAyd about not casting or otherwise changing code. #type:ignores and comments keep visibility of the typing issues/deficiencies.

For evolving/defining more precise types, pandas-stubs is now our playground so cc @Dr-Irv for comment

Otherwise generally +1

Dr-Irv · 2023-02-24T13:52:15Z

Not looked in detail but from my attempts at using TypeGuard, agree that Union types can cause problems. (This was a while back and not up-to-date on best practice on this)

Looking at the places in this PR where cast had to be done to handle cases where you wanted int and not np.integer, I think it would be better if we had more explicit is_* functions. So in _libs/lib.pyi, rather than

def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
def is_integer(val: object) -> TypeGuard[int | np.integer]: ...

you would have:

def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
def is_integer(val: object) -> TypeGuard[int | np.integer]: ...
def is_pybool(val: object) -> TypeGuard[bool]: ...
def is_pyinteger(val: object) -> TypeGuard[int]: ...

Then in pandas/core/generic.py, around line 4076, instead of

            loc, new_index = index._get_loc_level(key, level=0)
            if not drop_level:
                if lib.is_integer(loc):
                    loc = cast(int, loc)
                    new_index = index[loc : loc + 1]

you could remove the cast by doing

            loc, new_index = index._get_loc_level(key, level=0)
            if not drop_level:
                if lib.is_pyinteger(loc):
                    new_index = index[loc : loc + 1]

For evolving/defining more precise types, pandas-stubs is now our playground so cc @Dr-Irv for comment

I would alter that statement a bit by saying that pandas-stubs is meant to provide types for the public API, as opposed to being precise. For something like the is_* functions, they are for internal use, so if TypeGuard helps, use it.

As for a dependency on typing_extensions, I think this had to be done when we still supported python 3.7 and were using Literal in typing constructs. So if we were willing to use typing_extensions back then, we should be OK to use it now.

simonjayhawkins · 2023-02-24T14:58:50Z

As for a dependency on typing_extensions, I think this had to be done when we still supported python 3.7 and were using Literal in typing constructs. So if we were willing to use typing_extensions back then, we should be OK to use it now.

IIRC we never used typing_extensions and have only ever used workarounds.

Dr-Irv · 2023-02-24T18:09:32Z

As for a dependency on typing_extensions, I think this had to be done when we still supported python 3.7 and were using Literal in typing constructs. So if we were willing to use typing_extensions back then, we should be OK to use it now.

IIRC we never used typing_extensions and have only ever used workarounds.

Yes, I agree. I was mixing up some of our own application code with stuff in pandas. It seems that in pandas, we always surrounded an import of Literal with an if TYPE_CHECKING, and the CI would do type checking using python 3.8, even though we still supported python 3.7.

Which does suggest that maybe the solution to avoid using typing_extensions is to just copy the code for TypeGuard from typing_extensions, i.e.:

if sys.version_info >= (3, 10):
    from typing import TypeGuard
else:  # Copied from typing_extensions
    import typing

    class _TypeGuardForm(typing._SpecialForm, _root=True):
        def __repr__(self):
            return "typing_extensions." + self._name

    @_TypeGuardForm
    def TypeGuard(self, parameters):
        item = typing._type_check(parameters, f"{self} accepts only a single type.")
        return typing._GenericAlias(self, (item,))

I don't know if this will work, and you might need to add some #type: ignore and/or #pyright: ignore statements in there, but just copying what is in typing_extensions ought to work.

It might also be worth considering whether it is worth using python 3.10 or higher in CI for type checking, in which case you don't need the whole else section above, because any annotations with TypeGuard would be ignored at runtime.

jbrockmendel · 2023-02-25T21:59:19Z

@topper-123 just noticed a type:ignore where we do is_list_like(obj) and len(obj) == ... where mypy doesn't get that obj is listlike. could we address that with a TypeGuard on is_list_like?

From a short glance I don't think that is possible: The TypeGuard would be (in addition to list, ndarray) any iterable, except strings and bytes. Exceptions like that can't be represented in the type system to my knowledge.

twoertwein · 2023-02-25T23:55:54Z

pandas/util/_validators.py

@@ -258,7 +258,7 @@ def validate_bool_kwarg(
            f'For argument "{arg_name}" expected type bool, received '
            f"type {type(value).__name__}."
        )
-    return value
+    return value  # pyright: ignore[reportGeneralTypeIssues]


Could also add this file to the long list here https://github.com/pandas-dev/pandas/blob/main/pyright_reportGeneralTypeIssues.json#L17

But I wouldn't mind start using line-by-line ignores. I think we started with file-based ignores as we have in most cases just too many errors.

Things like this will show up many place where we use is_integer (after this PR), so IMO it file based exclusions will not work because every time we use is_integer in new files, these typing errors may show up. So using file based ignores will quickly make using pyright less meaningful.

So I think we will have to either:

use line based ignores (allows the widest use of TypeGuard)

only use a single type inside TypeGuard (minimizes issues when using functions that implement TypeGuard)

You can get rid of the error by changing the sig of the function:

def validate_bool_kwarg( value: bool | int | None | np.bool_, arg_name, none_allowed: bool = True, int_allowed: bool = False ) -> bool | int | None | np.bool_:

In this case, the type checker was telling you that np.bool_ is a valid type for the value argument.

WillAyd · 2023-02-26T19:33:01Z

I think this PR looks pretty good but want to go back to not using cast that @simonjayhawkins and I touched on before. Can those be removed? Replaced with ignore at least?

Didn't see a response on that - sorry if overlooking

topper-123 · 2023-02-26T19:38:18Z

I think this PR looks pretty good but want to go back to not using cast that @simonjayhawkins and I touched on before. Can those be removed? Replaced with ignore at least?

I can make a updated version that uses ignores rather than casts. I'll ping when I've done it.

topper-123 · 2023-02-26T20:01:20Z

@Dr-Irv , a problem with implementing is_pyinteger etc. IMO is that if a np.integer subtype is supplied to those locations it will not be accepted. E.g.

if is_pyinteger(obj):
   # do the right thing for ints
else:
   # do the wrong things for ints

We'll risk that np.integer types will go to the else clause, which will increase the risk of errors, AFAIKS.

jbrockmendel · 2023-02-26T21:19:41Z

Also if we wanted is_py_integer wouldn't we just use isinstance(obj, int)?

topper-123 · 2023-02-27T10:34:23Z

Ive made a version without casts. The errors seem unrelated.

Dr-Irv · 2023-03-08T12:01:06Z

pandas/util/_validators.py

@@ -437,4 +437,4 @@ def validate_insert_loc(loc: int, length: int) -> int:
        loc += length
    if not 0 <= loc <= length:
        raise IndexError(f"loc must be an integer between -{length} and {length}")
-    return loc
+    return loc  # pyright: ignore[reportGeneralTypeIssues]


Change the sig here too:

def validate_insert_loc(loc: int | np.integer, length: int) -> int | np.integer:

Note - if you really don't want to allow np.integer, then change the test inside to not use is_integer() .

Dr-Irv · 2023-03-08T12:09:51Z

pandas/io/parsers/python_parser.py

@@ -1348,4 +1348,5 @@ def _validate_skipfooter_arg(skipfooter: int) -> int:
    if skipfooter < 0:
        raise ValueError("skipfooter cannot be negative")

-    return skipfooter
+    # Incompatible return value type (got "Union[int, integer[Any]]", expected "int")
+    return skipfooter  # type: ignore[return-value]


could remove # type: ignore by changing signature:

def _validate_skipfooter_arg(skipfooter: int | np.integer) -> int | np.integer:

I think it's not good to change type signatures to keep the type checkers happy, mostly because it could easily spread to be needed everywhere, which would be a PITA.

I think it's not good to change type signatures to keep the type checkers happy, mostly because it could easily spread to be needed everywhere, which would be a PITA.

But that's part of the point of type checking! The fact that you had to do a # type: ignore is a signal that the code is inconsistent in terms of the expected arguments. In this particular case, the function is called only once, so changing the type in the function signature won't matter. Not sure of the other cases that I noted elsewhere in the PR.

twoertwein

LTGM I like that the code is now getting more explicit about bool/np.bool (even though it would be nice to use only bool)!

Rebasing should probably remove the new dependency in requirements-dev.txt and environment.yml

topper-123 · 2023-03-16T06:34:19Z

@simonjayhawkins you seem to have a blocking request. Could you see if everything is ok?

simonjayhawkins · 2023-03-16T11:34:02Z

Thanks @topper-123

The block was only to give a chance for feedback on the typing_extensions discussion.

I'm a bit skeptical that TypeGuard is a silver bullet for all is_* functions.

For instance the dtype functions can return True for a Dtype object or an array having that dtype.

def is_bool_dtype(arr_or_dtype) -> bool:
    """
    Check whether the provided array or dtype is of a boolean dtype.

    Parameters
    ----------
    arr_or_dtype : array-like or dtype
        The array or dtype to check.

I think if we introduce a union return for these, we will require too many ignores/casts/asserts.

Similarly, I think it is unlikely to help with functions such as is_array_like unless perhaps we use Protocols instead on union return types.

For the uses in this PR, we have union return types to include the the numpy types which aren't subtypes of the Python types. I recall looking at this previously (around the time that we introduced the numpy types) and I think that the general rule is that where we accept a Python type we also probably accept a numpy type, but not necessarily the other way. So, in theory, we should not really need any ignores/casts/asserts in most cases and could define aliases in _typing for the unions of the Python and Numpy types and use those throughout.

Side note: I see a use of asserts in this PR. It is subjective and we use assert with None in many places, but in general this always introduces the potential to raise an AssertionError at run-time.

otherwise lgtm

WillAyd · 2023-03-16T20:14:43Z

Great work @jbrockmendel @topper-123

WillAyd · 2023-03-16T20:15:32Z

Similarly, I think it is unlikely to help with functions such as is_array_like unless perhaps we use Protocols instead on union return types.

Interesting conversation. I've fallen behind on the latest typing functionality but maybe some combination of overloads on the argument type + TypeGuard as the return type could solve this issue too?

WillAyd reviewed Feb 10, 2023

View reviewed changes

topper-123 requested changes Feb 10, 2023

View reviewed changes

WillAyd reviewed Feb 13, 2023

View reviewed changes

topper-123 reviewed Feb 16, 2023

View reviewed changes

topper-123 mentioned this pull request Feb 18, 2023

TYP: Use Self instead of class-bound TypeVar in pyi files (remove Y019) #51480

Merged

simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Feb 22, 2023

simonjayhawkins requested changes Feb 24, 2023

View reviewed changes

simonjayhawkins mentioned this pull request Feb 24, 2023

TYP: use Self for type checking #51233

Closed

twoertwein mentioned this pull request Feb 25, 2023

Support an ExtensionDtype and ExtensionArray pandas-dev/pandas-stubs#554

Merged

twoertwein reviewed Feb 25, 2023

View reviewed changes

topper-123 approved these changes Feb 26, 2023

View reviewed changes

topper-123 mentioned this pull request Mar 1, 2023

typing_extensions #37119

Closed

Dr-Irv reviewed Mar 8, 2023

View reviewed changes

twoertwein approved these changes Mar 15, 2023

View reviewed changes

jbrockmendel added 2 commits March 15, 2023 19:36

TYP: TypeGuard

76a998c

TYP: try TypeGuard

e12b7a7

jbrockmendel and others added 12 commits March 15, 2023 19:38

suggested edits

1b3ef00

pyright fixup

a1d1517

mypy fixup

024ed70

mypy/pyright fixup

0e4d37a

ignore instead of cast

6be7228

pyright fixup

b4ee975

make use of typing_extensions odepend of python version

e683d19

fixes to typing

64e398e

fix integer

2940b01

fix pyright

35746ac

fix sql tests

f8248e5

remove typing.cast

1d5f3e4

topper-123 force-pushed the tguard branch from f35bd1d to 1d5f3e4 Compare March 15, 2023 19:54

fix mypy

3e0c4a0

simonjayhawkins approved these changes Mar 16, 2023

View reviewed changes

WillAyd merged commit 4ef638a into pandas-dev:main Mar 16, 2023

jbrockmendel deleted the tguard branch March 16, 2023 20:32

TYP: try out TypeGuard #51309

TYP: try out TypeGuard #51309

Conversation

jbrockmendel commented Feb 10, 2023

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Feb 12, 2023

jbrockmendel commented Feb 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Feb 18, 2023 • edited Loading

simonjayhawkins left a comment

Choose a reason for hiding this comment

Dr-Irv commented Feb 24, 2023

simonjayhawkins commented Feb 24, 2023

Dr-Irv commented Feb 24, 2023

jbrockmendel commented Feb 25, 2023 • edited by topper-123 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 26, 2023

topper-123 commented Feb 26, 2023

topper-123 commented Feb 26, 2023

jbrockmendel commented Feb 26, 2023

topper-123 commented Feb 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twoertwein left a comment

Choose a reason for hiding this comment

topper-123 commented Mar 16, 2023

simonjayhawkins commented Mar 16, 2023

WillAyd commented Mar 16, 2023

WillAyd commented Mar 16, 2023

topper-123 left a comment •

edited

Loading

topper-123 Feb 10, 2023 •

edited

Loading

topper-123 Mar 10, 2023 •

edited

Loading

topper-123 Feb 17, 2023 •

edited

Loading

topper-123 commented Feb 18, 2023 •

edited

Loading

jbrockmendel commented Feb 25, 2023 •

edited by topper-123

Loading