Skip to content

TYP: Remove Sequence[str] in type annotations #47232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Dr-Irv opened this issue Jun 4, 2022 · 18 comments
Open

TYP: Remove Sequence[str] in type annotations #47232

Dr-Irv opened this issue Jun 4, 2022 · 18 comments
Labels
Typing type annotations, mypy/pyright type checking

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2022

In various places, we use Sequence[str] as a type annotation. The problem is that a regular string also matches Sequence[str] . So we need to look at the various places we use Sequence[str] and change the type accordingly. Maybe just List or ArrayLike, dependent on the context.

@twoertwein
Copy link
Member

The problem is that a regular string also matches Sequence[str].

Is this a problem because:

  1. the implementation doesn't actually work with a "sequence of characters"?
  2. users expect that "some string" should (magically) be treated as ("some string", )?
  3. it makes overloads for the public/internal type annotations difficult?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 5, 2022

Mostly it is this:

the implementation doesn't actually work with a "sequence of characters"?

So we document a method like DataFrame.to_string() with columns and header being a list of columns/names, and type it as Sequence[str], but you can't just pass a single string (it would match for typing). What we want to accept there is something array-like, but not a plain string.

@simonjayhawkins
Copy link
Member

Our normal runtime validation in cases that accept a list-like (but not in this case I think?) is here

pandas/pandas/_libs/lib.pyx

Lines 1073 to 1112 in df8acf4

def is_list_like(obj: object, allow_sets: bool = True) -> bool:
"""
Check if the object is list-like.
Objects that are considered list-like are for example Python
lists, tuples, sets, NumPy arrays, and Pandas Series.
Strings and datetime objects, however, are not considered list-like.
Parameters
----------
obj : object
Object to check.
allow_sets : bool, default True
If this parameter is False, sets will not be considered list-like.
Returns
-------
bool
Whether `obj` has list-like properties.
Examples
--------
>>> import datetime
>>> is_list_like([1, 2, 3])
True
>>> is_list_like({1, 2, 3})
True
>>> is_list_like(datetime.datetime(2017, 1, 1))
False
>>> is_list_like("foo")
False
>>> is_list_like(1)
False
>>> is_list_like(np.array([2]))
True
>>> is_list_like(np.array(2))
False
"""
return c_is_list_like(obj, allow_sets)

we need a type that matches that (maybe need an extension) or the type would be either wider or narrower.

wider doesn't alert users that they have passed the wrong type.

narrower gives false positives.

Because the normal practice is to use Any in gradual typing for complex types that cannot (yet) be expressed or to just allow dynamic typing, my preference is normally to avoid false positives and have wider types for now.

So it appears that if if the docs say list-like of str then Sequence[str] is the best representation we currently have?

However @Dr-Irv I think we had a similar discussion elsewhere and your preference is for narrower types?

@simonjayhawkins simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Jun 5, 2022
@twoertwein
Copy link
Member

In case of to_string it errors in pandas/core/indexes/base.py in __new__ because is_scalar is True for a string.

Personally, I would start with testing which functions accept strings, then tighten the types of the (internal) functions that fail, and propagate the changes based on the mypy errors/call stack (in case of un-typed functions).

Creating one type that matches as much but not too much will be difficult.

@simonjayhawkins
Copy link
Member

Creating one type that matches as much but not too much will be difficult.

This is not really a pandas issue. see python/typing#256 and https://mail.python.org/archives/list/[email protected]/thread/JSZ4Q7B7FBF5LBVG2V6AAHLCQJSMDEAK/#3275P4426ARXTSDBU72O4KJLCSMRD24E

IMO leave it as it is and hope that in the future the Python type system can exclude types.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 5, 2022

So it appears that if if the docs say list-like of str then Sequence[str] is the best representation we currently have?

My concern is that Sequence[str] is too wide because a regular string will match it.

However @Dr-Irv I think we had a similar discussion elsewhere and your preference is for narrower types?

Yes, I prefer the narrower types. I think that most users when they see "list-like" will pass a list anyway.

Personally, I would start with testing which functions accept strings, then tighten the types of the (internal) functions that fail, and propagate the changes based on the mypy errors/call stack (in case of un-typed functions).

Looking through the code, whenever we used Sequence[str] as a type, we meant "sequence of strings", and we don't accept a string, so changing them all is the right way to go, IMHO.

Creating one type that matches as much but not too much will be difficult.

I agree. See discussion in python/typing#256 for various proposals. As @simonjayhawkins pointed out, it's not a pandas issue, and we just have to decide whether to be too narrow or too wide. Right now, we are too wide.

IMO leave it as it is and hope that in the future the Python type system can exclude types.

Well, I have the PR #47233 that addresses it.

@twoertwein
Copy link
Member

At the very least, we could/should create a type alias ListLike = Sequence[str] and use it wherever we can in the future remove str itself. typeshed uses type aliases for such purposes a lot.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 5, 2022

At the very least, we could/should create a type alias ListLike = Sequence[str] and use it wherever we can in the future remove str itself. typeshed uses type aliases for such purposes a lot.

I agree on creating the ListLike alias, but to not use Sequence, and use

ListLike = Union[AnyArrayLike, List]

I'm having a hard time coming up with something that is a Sequence but not a List or array that people would use.

I should note that for to_string(), if your column names are integers, you can pass a list of integers for the columns and header argument.

@twoertwein
Copy link
Member

Not sure whether that is the case: in some cases, the internal code might explicitly raise because we didn't want to accept strings but the code would probably work with strings. If it is okay to change the code to work with strings, we don't have this typing issue :)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 5, 2022

Not sure whether that is the case: in some cases, the internal code might explicitly raise because we didn't want to accept strings but the code would probably work with strings. If it is okay to change the code to work with strings, we don't have this typing issue :)

I think that is a much more substantial change!

@simonjayhawkins
Copy link
Member

pytype doesn't have this issue python/typing#256 (comment)

I think that is a much more substantial change!

easier to change the type checker (or contribute that change to mypy/pyright)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 5, 2022

pytype doesn't have this issue python/typing#256 (comment)

I think that is a much more substantial change!

easier to change the type checker (or contribute that change to mypy/pyright)

Yes, I agree with you there, but this issue has been around for so long in the typing community, we may have to wait a long time for mypy and pyright to do anything about it.

@twoertwein
Copy link
Member

pytype doesn't have this issue python/typing#256 (comment)

Without a PEP, pyright/mypy are very unlikely to "fix" this behavior.

Speculating/hoping: If all/most non-string Sequences have a method that str doesn't has, we could simply create our own protocol (Inherit from sequence and add a string-incompatible method).

@simonjayhawkins
Copy link
Member

So it appears that if if the docs say list-like of str then Sequence[str] is the best representation we currently have?

My concern is that Sequence[str] is too wide because a regular string will match it.

I don't see that as a big deal. (it's not a pandas issue)

I assume the index/columns of the DataFrame constructor and DataFrame.to_string() should probably accept the same types as the Index constructor data argument? The docstring for that is data: array-like (1-dimensional). I assume you are equally concerned that AnyArrayLike is too wide because a 2d array will match it?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 6, 2022

My concern is that Sequence[str] is too wide because a regular string will match it.

I don't see that as a big deal. (it's not a pandas issue)

It's a big deal from a user of the public API. We can always use a narrower type in pandas-stubs .

I assume the index/columns of the DataFrame constructor and DataFrame.to_string() should probably accept the same types as the Index constructor data argument? The docstring for that is data: array-like (1-dimensional). I assume you are equally concerned that AnyArrayLike is too wide because a 2d array will match it?

Fair point because of np.ndarray being multidimensional, so we'd have to define a type for 1D arrays.

Having said that, I think the error of doing pd.DataFrame([1,2,3], columns="abc") is something typing should trap.

@twoertwein
Copy link
Member

Personally, I'm not a big fan of addressing one exception (str in Sequence[str]) and then creating a whitelist of accepted types. Seems like we are adding more exceptions. I would be fine having something like ListLikeButNotStr = Sequence[str] and then using it wherever we do not accept strings.

I think before going forward we should answer some of the following questions:

  • Should this be the case for only the public API or are also for the internal code? If it is only the public code: should it be pandas and pandas-stubs?
  • If we narrow input arguments, we should 1) acknowledge somewhere that that is what we aim for and 2) technically also widen return arguments.
  • What are the implications of this step (breaking Hashable apart)?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 6, 2022

  • Should this be the case for only the public API or are also for the internal code? If it is only the public code: should it be pandas and pandas-stubs?

For pandas-stubs, I'm taking out all references to Sequence[str] and replacing it with List[str]. Maybe that will be too narrow, but if someone has an argument x for some method that is a sequence and not a list, they could just do list(x) and the type checker will be happy.

I think the debate here is more about what we do internally in the code. For the PR's I submitted, narrowing the type to List[str] still passes the mypy type checking.

The other issue is more of a documentation one, where we are not necessarily precise in the meanings of list-like, array-like. I think that many times where we write in the docs array-like, we mean 1-D array-like, as we wouldn't accept a higher dimensional array for those arguments.

@twoertwein
Copy link
Member

The protocol idea to avoid str but include all "proper" Sequences (proper: that are compatible with its methods, unlike str) seems to gain more traction: hauntsaninja/useful_types#14

Could use that in pandas (and pandas-stubs) to avoid some lengthy unions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

No branches or pull requests

3 participants