Skip to content

TYP: try out TypeGuard #51309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 16, 2023
Merged

TYP: try out TypeGuard #51309

merged 15 commits into from
Mar 16, 2023

Conversation

jbrockmendel
Copy link
Member

This effectively makes mypy understand that is_integer(obj) is equivalent to isinstance(obj, (int, np.integer)). This is a mixed bag. It allows for getting rid of some redundant casts/ignores, but makes some new ones necessary. I'd be fine either way on this.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is awesome. Even though it requires some fixes elsewhere I think it moves us in the right direction so +1 from me

# error: Incompatible types in assignment (expression has type
# "Union[int, Any]", variable has type "Union[ndarray[Any,
# dtype[signedinteger[Any]]], signedinteger[Any]]")
loc -= 1 # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a little rusty on my python typing but I think it just infers the value from the first appearing of a variable within a function. loc is up higher in the func from algos.searchsorted - would that ever return a scalar? If not we might just want to be explicit that loc can be an int as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

algos.searchsorted returns npt.NDArray[np.intp] | np.intp. A couple lines of we should have ruled out the ndarray case. so we have np.intp and its complaining that it isnt the same as int.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on using TypeGuard, it can help with typing, but I'm probably -1 when there is a composite return type in the TypeGuard, if that just gives noise/ complications in other locations.

We also have to handle the situation when typing_extensions isn't installed and also use the built-in TypeGuard, if on python >= 3.10.

def is_period(val: object) -> TypeGuard[Period]: ...
def is_interval(val: object) -> TypeGuard[Interval]: ...
def is_decimal(val: object) -> TypeGuard[Decimal]: ...
def is_complex(val: object) -> TypeGuard[complex | np.complex_]: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just TypeGuard[complex], as complext64 & complex128 are subtypes of the regular python type complex.

def is_decimal(val: object) -> TypeGuard[Decimal]: ...
def is_complex(val: object) -> TypeGuard[complex | np.complex_]: ...
def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
def is_integer(val: object) -> TypeGuard[int | np.integer]: ...
Copy link
Contributor

@topper-123 topper-123 Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of dislike having composite types inside the type guard (because it probably gives some uglier code in other locations). Maybe only use TypeGuard when it gives a single type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get where youre coming from, but i also have an urge for "maximum accuracy" here. im going to keep it how it is for now but will be OK with changing if consensus develops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the Union (but it would be nicer without it). I think not having the Unions might cause some incorrect unreachable code warnings from mypy/pyright.

if is_integer(x):
    if isinstance(x, np.number): # I believe mypy would error here if it wouldn't return a Union
        x = x.item()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To some extent, this change is revealing an issue throughout the pandas code.

Sometimes, when we call is_integer(value) or is_bool(value), the intent is to check whether the value is an integer or numpy integer, or bool or numpy boolean. In other cases, we actually want to check that the value is NOT a numpy type (although I'm not 100% sure of that).

If it is the case that we accept a numpy integer/bool whenever we have typed an argument as int or bool, respectively, then the types used within the pandas functions should change to reflect that.

Copy link
Contributor

@topper-123 topper-123 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to check that it is a python integer, we'd just do isinstance(x, int). My issue was more that by having a union type here, we'd have to have this union type in a lot of places, which doesn't feel like very clean code. My suggestion was to not type is_integer or other with union return types.

But ok, it was just a preference, I can also accept it, as overall this PR is a nice improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue was more that by having a union type here, we'd have to have this usion type in a lot of places, which doesn't feel like very clean code.

Or for every place that such a change is necessary, it makes you analyze the code and determine whether you really want to change the test to isinstance(x, int) or modify other parts of the code knowing that a np.integer is valid.

@jbrockmendel
Copy link
Member Author

One place in util._validators where making pyright and mypy both pass at the same time is a hassle

/home/runner/work/pandas/pandas/pandas/util/validators.py:262:12 - error: Expression of type "bool | bool | int | BoolishNoneT@validate_bool_kwarg | None" cannot be assigned to return type "BoolishNoneT@validate_bool_kwarg"
Type "bool | bool_ | int | BoolishNoneT@validate_bool_kwarg | None" cannot be assigned to type "BoolishNoneT@validate_bool_kwarg" (reportGeneralTypeIssues)

@jbrockmendel
Copy link
Member Author

Tried annotating Index._is_multi as TypeGuard[MultiIndex] hoping a bunch of existing ignores would become redundant, no luck

@@ -222,7 +222,7 @@ def validate_cum_func_with_skipna(skipna, args, kwargs, name) -> bool:
skipna = True

validate_cum_func(args, kwargs, fname=name)
return skipna
return bool(skipna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a fan of explicitly creating new objects like this to appease typing; I think an ignore is preferable. Any idea how this gets out of whack in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a fan of explicitly creating new objects like this to appease typing; I think an ignore is preferable

works for me

Any idea how this gets out of whack in the first place?

The return annotation wants bool but we could have np.bool_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - yea I would just ignore for now then

@@ -9936,6 +9932,7 @@ def _series_round(ser: Series, decimals: int):
new_cols = list(_dict_round(self, decimals))
elif is_integer(decimals):
# Dispatch to Series.round
decimals = int(decimals)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here - don't think we should do the int construction

@@ -83,8 +83,11 @@
# Name "npt._ArrayLikeInt_co" is not defined [name-defined]
NumpySorter = Optional[npt._ArrayLikeInt_co] # type: ignore[name-defined]

from typing_extensions import TypeGuard # pyright: reportUnusedImport = false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd import from standard lib if on python >= 3.10, else from typing_extensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to take into account that typing_extensions may not be installed. All in all I'd probably do this something like this:

if PY310:
    from typing import TypeGuard
else:
   try:
        from typing_extensions import TypeGuard
    except ImportError:
        TypeGuard = bool

Also add typing_extensions to requirements-dev.txt , I think it's possible to add it like

typing_extensions; python_version < '3.10'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that work in an environment.yml file? requirements-dev.txt is auto-generated from the environment.yml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking for py310 requires a circular import which isnt great. im leaning towards abandoning this PR, would be happy if you'd like to get it across the finish line

Copy link
Contributor

@topper-123 topper-123 Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't know it is autogenerated and I not knowledgeable with environment.yml to know if it`s possible there. I could take a look. I can also try take over, no problem.

objs,
intersect: bool = False,
axis: Axis = 0,
sort: bool | np.bool_ = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead wrap arguments to this function in a bool, i.e. get_objs_combined_axis(..., sort=bool(sort), ...)?

If not that, maybe in in the _typing have union types, e.g. any_bool = Union[bool, np.bool_], any_int = union[int, np.int_] etc.?

else:
npt: Any = None
TypeGuard: Any = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeGuard = bool? In my understanding a TypeGuard is like a bool that ensure that another has some type…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably not be outside of the if TYPE_CHECKING section?

@topper-123
Copy link
Contributor

topper-123 commented Feb 18, 2023

I've made a new version, could you take look?

I dislike union typeguard unless a clear reason, because they require casts or checks in other locations, but I've kept them anyway in this version. @jorisvandenbossche , what's your opinion on those, e.g. def is_integer(val: object) -> TypeGuard[int | np.integer]?

@simonjayhawkins simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Feb 22, 2023
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jbrockmendel

There is a discussion over in #51480 about typing_extensions that needs to be resolved first hence the request changes review.

Not looked in detail but from my attempts at using TypeGuard, agree that Union types can cause problems. (This was a while back and not up-to-date on best practice on this)

Seconding, agree with @WillAyd about not casting or otherwise changing code. #type:ignores and comments keep visibility of the typing issues/deficiencies.

For evolving/defining more precise types, pandas-stubs is now our playground so cc @Dr-Irv for comment

Otherwise generally +1

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 24, 2023

Not looked in detail but from my attempts at using TypeGuard, agree that Union types can cause problems. (This was a while back and not up-to-date on best practice on this)

Looking at the places in this PR where cast had to be done to handle cases where you wanted int and not np.integer, I think it would be better if we had more explicit is_* functions. So in _libs/lib.pyi, rather than

def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
def is_integer(val: object) -> TypeGuard[int | np.integer]: ...

you would have:

def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
def is_integer(val: object) -> TypeGuard[int | np.integer]: ...
def is_pybool(val: object) -> TypeGuard[bool]: ...
def is_pyinteger(val: object) -> TypeGuard[int]: ...

Then in pandas/core/generic.py, around line 4076, instead of

            loc, new_index = index._get_loc_level(key, level=0)
            if not drop_level:
                if lib.is_integer(loc):
                    loc = cast(int, loc)
                    new_index = index[loc : loc + 1]

you could remove the cast by doing

            loc, new_index = index._get_loc_level(key, level=0)
            if not drop_level:
                if lib.is_pyinteger(loc):
                    new_index = index[loc : loc + 1]

For evolving/defining more precise types, pandas-stubs is now our playground so cc @Dr-Irv for comment

I would alter that statement a bit by saying that pandas-stubs is meant to provide types for the public API, as opposed to being precise. For something like the is_* functions, they are for internal use, so if TypeGuard helps, use it.

As for a dependency on typing_extensions, I think this had to be done when we still supported python 3.7 and were using Literal in typing constructs. So if we were willing to use typing_extensions back then, we should be OK to use it now.

@simonjayhawkins
Copy link
Member

As for a dependency on typing_extensions, I think this had to be done when we still supported python 3.7 and were using Literal in typing constructs. So if we were willing to use typing_extensions back then, we should be OK to use it now.

IIRC we never used typing_extensions and have only ever used workarounds.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 24, 2023

As for a dependency on typing_extensions, I think this had to be done when we still supported python 3.7 and were using Literal in typing constructs. So if we were willing to use typing_extensions back then, we should be OK to use it now.

IIRC we never used typing_extensions and have only ever used workarounds.

Yes, I agree. I was mixing up some of our own application code with stuff in pandas. It seems that in pandas, we always surrounded an import of Literal with an if TYPE_CHECKING, and the CI would do type checking using python 3.8, even though we still supported python 3.7.

Which does suggest that maybe the solution to avoid using typing_extensions is to just copy the code for TypeGuard from typing_extensions, i.e.:

if sys.version_info >= (3, 10):
    from typing import TypeGuard
else:  # Copied from typing_extensions
    import typing

    class _TypeGuardForm(typing._SpecialForm, _root=True):
        def __repr__(self):
            return "typing_extensions." + self._name

    @_TypeGuardForm
    def TypeGuard(self, parameters):
        item = typing._type_check(parameters, f"{self} accepts only a single type.")
        return typing._GenericAlias(self, (item,))

I don't know if this will work, and you might need to add some #type: ignore and/or #pyright: ignore statements in there, but just copying what is in typing_extensions ought to work.

It might also be worth considering whether it is worth using python 3.10 or higher in CI for type checking, in which case you don't need the whole else section above, because any annotations with TypeGuard would be ignored at runtime.

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Feb 25, 2023

@topper-123 just noticed a type:ignore where we do is_list_like(obj) and len(obj) == ... where mypy doesn't get that obj is listlike. could we address that with a TypeGuard on is_list_like?

From a short glance I don't think that is possible: The TypeGuard would be (in addition to list, ndarray) any iterable, except strings and bytes. Exceptions like that can't be represented in the type system to my knowledge.

The closest we have in the type system is _typing.ListLike, which is ExtensionArray | np.ndarray | Index | Series | List | range.

@@ -258,7 +258,7 @@ def validate_bool_kwarg(
f'For argument "{arg_name}" expected type bool, received '
f"type {type(value).__name__}."
)
return value
return value # pyright: ignore[reportGeneralTypeIssues]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also add this file to the long list here https://github.com/pandas-dev/pandas/blob/main/pyright_reportGeneralTypeIssues.json#L17

But I wouldn't mind start using line-by-line ignores. I think we started with file-based ignores as we have in most cases just too many errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things like this will show up many place where we use is_integer (after this PR), so IMO it file based exclusions will not work because every time we use is_integer in new files, these typing errors may show up. So using file based ignores will quickly make using pyright less meaningful.

So I think we will have to either:

  • use line based ignores (allows the widest use of TypeGuard)
  • only use a single type inside TypeGuard (minimizes issues when using functions that implement TypeGuard)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of the error by changing the sig of the function:

def validate_bool_kwarg(
    value: bool | int | None | np.bool_, arg_name, none_allowed: bool = True, int_allowed: bool = False
) -> bool | int | None | np.bool_:

In this case, the type checker was telling you that np.bool_ is a valid type for the value argument.

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2023

I think this PR looks pretty good but want to go back to not using cast that @simonjayhawkins and I touched on before. Can those be removed? Replaced with ignore at least?

Didn't see a response on that - sorry if overlooking

@topper-123
Copy link
Contributor

I think this PR looks pretty good but want to go back to not using cast that @simonjayhawkins and I touched on before. Can those be removed? Replaced with ignore at least?

I can make a updated version that uses ignores rather than casts. I'll ping when I've done it.

@topper-123
Copy link
Contributor

@Dr-Irv , a problem with implementing is_pyinteger etc. IMO is that if a np.integer subtype is supplied to those locations it will not be accepted. E.g.

if is_pyinteger(obj):
   # do the right thing for ints
else:
   # do the wrong things for ints

We'll risk that np.integer types will go to the else clause, which will increase the risk of errors, AFAIKS.

@jbrockmendel
Copy link
Member Author

Also if we wanted is_py_integer wouldn't we just use isinstance(obj, int)?

@topper-123
Copy link
Contributor

Ive made a version without casts. The errors seem unrelated.

@topper-123 topper-123 mentioned this pull request Mar 1, 2023
@@ -437,4 +437,4 @@ def validate_insert_loc(loc: int, length: int) -> int:
loc += length
if not 0 <= loc <= length:
raise IndexError(f"loc must be an integer between -{length} and {length}")
return loc
return loc # pyright: ignore[reportGeneralTypeIssues]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the sig here too:

def validate_insert_loc(loc: int | np.integer, length: int) -> int | np.integer:

Note - if you really don't want to allow np.integer, then change the test inside to not use is_integer() .

@@ -1348,4 +1348,5 @@ def _validate_skipfooter_arg(skipfooter: int) -> int:
if skipfooter < 0:
raise ValueError("skipfooter cannot be negative")

return skipfooter
# Incompatible return value type (got "Union[int, integer[Any]]", expected "int")
return skipfooter # type: ignore[return-value]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could remove # type: ignore by changing signature:

def _validate_skipfooter_arg(skipfooter: int | np.integer) -> int | np.integer:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not good to change type signatures to keep the type checkers happy, mostly because it could easily spread to be needed everywhere, which would be a PITA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not good to change type signatures to keep the type checkers happy, mostly because it could easily spread to be needed everywhere, which would be a PITA.

But that's part of the point of type checking! The fact that you had to do a # type: ignore is a signal that the code is inconsistent in terms of the expected arguments. In this particular case, the function is called only once, so changing the type in the function signature won't matter. Not sure of the other cases that I noted elsewhere in the PR.

Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM I like that the code is now getting more explicit about bool/np.bool (even though it would be nice to use only bool)!

Rebasing should probably remove the new dependency in requirements-dev.txt and environment.yml

@topper-123
Copy link
Contributor

@simonjayhawkins you seem to have a blocking request. Could you see if everything is ok?

@simonjayhawkins
Copy link
Member

Thanks @topper-123

The block was only to give a chance for feedback on the typing_extensions discussion.

I'm a bit skeptical that TypeGuard is a silver bullet for all is_* functions.

For instance the dtype functions can return True for a Dtype object or an array having that dtype.

def is_bool_dtype(arr_or_dtype) -> bool:
    """
    Check whether the provided array or dtype is of a boolean dtype.

    Parameters
    ----------
    arr_or_dtype : array-like or dtype
        The array or dtype to check.

I think if we introduce a union return for these, we will require too many ignores/casts/asserts.

Similarly, I think it is unlikely to help with functions such as is_array_like unless perhaps we use Protocols instead on union return types.

For the uses in this PR, we have union return types to include the the numpy types which aren't subtypes of the Python types. I recall looking at this previously (around the time that we introduced the numpy types) and I think that the general rule is that where we accept a Python type we also probably accept a numpy type, but not necessarily the other way. So, in theory, we should not really need any ignores/casts/asserts in most cases and could define aliases in _typing for the unions of the Python and Numpy types and use those throughout.

Side note: I see a use of asserts in this PR. It is subjective and we use assert with None in many places, but in general this always introduces the potential to raise an AssertionError at run-time.

otherwise lgtm

@WillAyd WillAyd merged commit 4ef638a into pandas-dev:main Mar 16, 2023
@WillAyd
Copy link
Member

WillAyd commented Mar 16, 2023

Great work @jbrockmendel @topper-123

@WillAyd
Copy link
Member

WillAyd commented Mar 16, 2023

Similarly, I think it is unlikely to help with functions such as is_array_like unless perhaps we use Protocols instead on union return types.

Interesting conversation. I've fallen behind on the latest typing functionality but maybe some combination of overloads on the argument type + TypeGuard as the return type could solve this issue too?

@jbrockmendel jbrockmendel deleted the tguard branch March 16, 2023 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants