-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TYP: try out TypeGuard #51309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TYP: try out TypeGuard #51309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is awesome. Even though it requires some fixes elsewhere I think it moves us in the right direction so +1 from me
pandas/core/indexes/multi.py
Outdated
# error: Incompatible types in assignment (expression has type | ||
# "Union[int, Any]", variable has type "Union[ndarray[Any, | ||
# dtype[signedinteger[Any]]], signedinteger[Any]]") | ||
loc -= 1 # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a little rusty on my python typing but I think it just infers the value from the first appearing of a variable within a function. loc
is up higher in the func from algos.searchsorted
- would that ever return a scalar? If not we might just want to be explicit that loc
can be an int
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
algos.searchsorted returns npt.NDArray[np.intp] | np.intp
. A couple lines of we should have ruled out the ndarray case. so we have np.intp and its complaining that it isnt the same as int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 on using TypeGuard
, it can help with typing, but I'm probably -1 when there is a composite return type in the TypeGuard
, if that just gives noise/ complications in other locations.
We also have to handle the situation when typing_extensions
isn't installed and also use the built-in TypeGuard
, if on python >= 3.10.
pandas/_libs/lib.pyi
Outdated
def is_period(val: object) -> TypeGuard[Period]: ... | ||
def is_interval(val: object) -> TypeGuard[Interval]: ... | ||
def is_decimal(val: object) -> TypeGuard[Decimal]: ... | ||
def is_complex(val: object) -> TypeGuard[complex | np.complex_]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be just TypeGuard[complex]
, as complext64
& complex128
are subtypes of the regular python type complex
.
def is_decimal(val: object) -> TypeGuard[Decimal]: ... | ||
def is_complex(val: object) -> TypeGuard[complex | np.complex_]: ... | ||
def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ... | ||
def is_integer(val: object) -> TypeGuard[int | np.integer]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of dislike having composite types inside the type guard (because it probably gives some uglier code in other locations). Maybe only use TypeGuard
when it gives a single type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i get where youre coming from, but i also have an urge for "maximum accuracy" here. im going to keep it how it is for now but will be OK with changing if consensus develops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the Union (but it would be nicer without it). I think not having the Unions might cause some incorrect unreachable code warnings from mypy/pyright.
if is_integer(x):
if isinstance(x, np.number): # I believe mypy would error here if it wouldn't return a Union
x = x.item()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To some extent, this change is revealing an issue throughout the pandas code.
Sometimes, when we call is_integer(value)
or is_bool(value)
, the intent is to check whether the value is an integer or numpy integer, or bool or numpy boolean. In other cases, we actually want to check that the value is NOT a numpy type (although I'm not 100% sure of that).
If it is the case that we accept a numpy integer/bool whenever we have typed an argument as int
or bool
, respectively, then the types used within the pandas functions should change to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to check that it is a python integer, we'd just do isinstance(x, int)
. My issue was more that by having a union type here, we'd have to have this union type in a lot of places, which doesn't feel like very clean code. My suggestion was to not type is_integer
or other with union return types.
But ok, it was just a preference, I can also accept it, as overall this PR is a nice improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue was more that by having a union type here, we'd have to have this usion type in a lot of places, which doesn't feel like very clean code.
Or for every place that such a change is necessary, it makes you analyze the code and determine whether you really want to change the test to isinstance(x, int)
or modify other parts of the code knowing that a np.integer
is valid.
One place in util._validators where making pyright and mypy both pass at the same time is a hassle /home/runner/work/pandas/pandas/pandas/util/validators.py:262:12 - error: Expression of type "bool | bool | int | BoolishNoneT@validate_bool_kwarg | None" cannot be assigned to return type "BoolishNoneT@validate_bool_kwarg" |
Tried annotating |
pandas/compat/numpy/function.py
Outdated
@@ -222,7 +222,7 @@ def validate_cum_func_with_skipna(skipna, args, kwargs, name) -> bool: | |||
skipna = True | |||
|
|||
validate_cum_func(args, kwargs, fname=name) | |||
return skipna | |||
return bool(skipna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a fan of explicitly creating new objects like this to appease typing; I think an ignore is preferable. Any idea how this gets out of whack in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a fan of explicitly creating new objects like this to appease typing; I think an ignore is preferable
works for me
Any idea how this gets out of whack in the first place?
The return annotation wants bool but we could have np.bool_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - yea I would just ignore for now then
pandas/core/frame.py
Outdated
@@ -9936,6 +9932,7 @@ def _series_round(ser: Series, decimals: int): | |||
new_cols = list(_dict_round(self, decimals)) | |||
elif is_integer(decimals): | |||
# Dispatch to Series.round | |||
decimals = int(decimals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here - don't think we should do the int
construction
@@ -83,8 +83,11 @@ | |||
# Name "npt._ArrayLikeInt_co" is not defined [name-defined] | |||
NumpySorter = Optional[npt._ArrayLikeInt_co] # type: ignore[name-defined] | |||
|
|||
from typing_extensions import TypeGuard # pyright: reportUnusedImport = false | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd import from standard lib if on python >= 3.10, else from typing_extensions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to take into account that typing_extensions
may not be installed. All in all I'd probably do this something like this:
if PY310:
from typing import TypeGuard
else:
try:
from typing_extensions import TypeGuard
except ImportError:
TypeGuard = bool
Also add typing_extensions
to requirements-dev.txt
, I think it's possible to add it like
typing_extensions; python_version < '3.10'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that work in an environment.yml file? requirements-dev.txt is auto-generated from the environment.yml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking for py310 requires a circular import which isnt great. im leaning towards abandoning this PR, would be happy if you'd like to get it across the finish line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't know it is autogenerated and I not knowledgeable with environment.yml to know if it`s possible there. I could take a look. I can also try take over, no problem.
pandas/core/indexes/api.py
Outdated
objs, | ||
intersect: bool = False, | ||
axis: Axis = 0, | ||
sort: bool | np.bool_ = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead wrap arguments to this function in a bool, i.e. get_objs_combined_axis(..., sort=bool(sort), ...)
?
If not that, maybe in in the _typing
have union types, e.g. any_bool = Union[bool, np.bool_]
, any_int = union[int, np.int_]
etc.?
else: | ||
npt: Any = None | ||
TypeGuard: Any = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeGuard = bool
? In my understanding a TypeGuard
is like a bool that ensure that another has some type…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably not be outside of the if TYPE_CHECKING
section?
I've made a new version, could you take look? I dislike union typeguard unless a clear reason, because they require casts or checks in other locations, but I've kept them anyway in this version. @jorisvandenbossche , what's your opinion on those, e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jbrockmendel
There is a discussion over in #51480 about typing_extensions that needs to be resolved first hence the request changes review.
Not looked in detail but from my attempts at using TypeGuard, agree that Union types can cause problems. (This was a while back and not up-to-date on best practice on this)
Seconding, agree with @WillAyd about not casting or otherwise changing code. #type:ignores and comments keep visibility of the typing issues/deficiencies.
For evolving/defining more precise types, pandas-stubs is now our playground so cc @Dr-Irv for comment
Otherwise generally +1
Looking at the places in this PR where def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
def is_integer(val: object) -> TypeGuard[int | np.integer]: ... you would have: def is_bool(val: object) -> TypeGuard[bool | np.bool_]: ...
def is_integer(val: object) -> TypeGuard[int | np.integer]: ...
def is_pybool(val: object) -> TypeGuard[bool]: ...
def is_pyinteger(val: object) -> TypeGuard[int]: ... Then in loc, new_index = index._get_loc_level(key, level=0)
if not drop_level:
if lib.is_integer(loc):
loc = cast(int, loc)
new_index = index[loc : loc + 1] you could remove the cast by doing loc, new_index = index._get_loc_level(key, level=0)
if not drop_level:
if lib.is_pyinteger(loc):
new_index = index[loc : loc + 1]
I would alter that statement a bit by saying that As for a dependency on |
IIRC we never used |
Yes, I agree. I was mixing up some of our own application code with stuff in pandas. It seems that in pandas, we always surrounded an import of Which does suggest that maybe the solution to avoid using if sys.version_info >= (3, 10):
from typing import TypeGuard
else: # Copied from typing_extensions
import typing
class _TypeGuardForm(typing._SpecialForm, _root=True):
def __repr__(self):
return "typing_extensions." + self._name
@_TypeGuardForm
def TypeGuard(self, parameters):
item = typing._type_check(parameters, f"{self} accepts only a single type.")
return typing._GenericAlias(self, (item,)) I don't know if this will work, and you might need to add some It might also be worth considering whether it is worth using python 3.10 or higher in CI for type checking, in which case you don't need the whole |
From a short glance I don't think that is possible: The TypeGuard would be (in addition to list, ndarray) any iterable, except strings and bytes. Exceptions like that can't be represented in the type system to my knowledge. The closest we have in the type system is |
@@ -258,7 +258,7 @@ def validate_bool_kwarg( | |||
f'For argument "{arg_name}" expected type bool, received ' | |||
f"type {type(value).__name__}." | |||
) | |||
return value | |||
return value # pyright: ignore[reportGeneralTypeIssues] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also add this file to the long list here https://github.com/pandas-dev/pandas/blob/main/pyright_reportGeneralTypeIssues.json#L17
But I wouldn't mind start using line-by-line ignores. I think we started with file-based ignores as we have in most cases just too many errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things like this will show up many place where we use is_integer
(after this PR), so IMO it file based exclusions will not work because every time we use is_integer
in new files, these typing errors may show up. So using file based ignores will quickly make using pyright less meaningful.
So I think we will have to either:
- use line based ignores (allows the widest use of
TypeGuard
) - only use a single type inside
TypeGuard
(minimizes issues when using functions that implementTypeGuard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get rid of the error by changing the sig of the function:
def validate_bool_kwarg(
value: bool | int | None | np.bool_, arg_name, none_allowed: bool = True, int_allowed: bool = False
) -> bool | int | None | np.bool_:
In this case, the type checker was telling you that np.bool_
is a valid type for the value
argument.
I think this PR looks pretty good but want to go back to not using cast that @simonjayhawkins and I touched on before. Can those be removed? Replaced with ignore at least? Didn't see a response on that - sorry if overlooking |
I can make a updated version that uses ignores rather than casts. I'll ping when I've done it. |
@Dr-Irv , a problem with implementing if is_pyinteger(obj):
# do the right thing for ints
else:
# do the wrong things for ints We'll risk that |
Also if we wanted is_py_integer wouldn't we just use |
Ive made a version without casts. The errors seem unrelated. |
@@ -437,4 +437,4 @@ def validate_insert_loc(loc: int, length: int) -> int: | |||
loc += length | |||
if not 0 <= loc <= length: | |||
raise IndexError(f"loc must be an integer between -{length} and {length}") | |||
return loc | |||
return loc # pyright: ignore[reportGeneralTypeIssues] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the sig here too:
def validate_insert_loc(loc: int | np.integer, length: int) -> int | np.integer:
Note - if you really don't want to allow np.integer
, then change the test inside to not use is_integer()
.
@@ -1348,4 +1348,5 @@ def _validate_skipfooter_arg(skipfooter: int) -> int: | |||
if skipfooter < 0: | |||
raise ValueError("skipfooter cannot be negative") | |||
|
|||
return skipfooter | |||
# Incompatible return value type (got "Union[int, integer[Any]]", expected "int") | |||
return skipfooter # type: ignore[return-value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could remove # type: ignore
by changing signature:
def _validate_skipfooter_arg(skipfooter: int | np.integer) -> int | np.integer:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not good to change type signatures to keep the type checkers happy, mostly because it could easily spread to be needed everywhere, which would be a PITA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not good to change type signatures to keep the type checkers happy, mostly because it could easily spread to be needed everywhere, which would be a PITA.
But that's part of the point of type checking! The fact that you had to do a # type: ignore
is a signal that the code is inconsistent in terms of the expected arguments. In this particular case, the function is called only once, so changing the type in the function signature won't matter. Not sure of the other cases that I noted elsewhere in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM I like that the code is now getting more explicit about bool
/np.bool
(even though it would be nice to use only bool
)!
Rebasing should probably remove the new dependency in requirements-dev.txt and environment.yml
@simonjayhawkins you seem to have a blocking request. Could you see if everything is ok? |
Thanks @topper-123 The block was only to give a chance for feedback on the typing_extensions discussion. I'm a bit skeptical that TypeGuard is a silver bullet for all is_* functions. For instance the dtype functions can return True for a Dtype object or an array having that dtype.
I think if we introduce a union return for these, we will require too many ignores/casts/asserts. Similarly, I think it is unlikely to help with functions such as is_array_like unless perhaps we use Protocols instead on union return types. For the uses in this PR, we have union return types to include the the numpy types which aren't subtypes of the Python types. I recall looking at this previously (around the time that we introduced the numpy types) and I think that the general rule is that where we accept a Python type we also probably accept a numpy type, but not necessarily the other way. So, in theory, we should not really need any ignores/casts/asserts in most cases and could define aliases in _typing for the unions of the Python and Numpy types and use those throughout. Side note: I see a use of asserts in this PR. It is subjective and we use assert with None in many places, but in general this always introduces the potential to raise an AssertionError at run-time. otherwise lgtm |
Great work @jbrockmendel @topper-123 |
Interesting conversation. I've fallen behind on the latest typing functionality but maybe some combination of overloads on the argument type + TypeGuard as the return type could solve this issue too? |
This effectively makes mypy understand that
is_integer(obj)
is equivalent toisinstance(obj, (int, np.integer))
. This is a mixed bag. It allows for getting rid of some redundant casts/ignores, but makes some new ones necessary. I'd be fine either way on this.