-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Remove Sequence[str] as type used in DataFrame.to_string() #47233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/_typing.py
Outdated
@@ -115,7 +114,7 @@ | |||
Ordered = Optional[bool] | |||
JSONSerializable = Optional[Union[PythonScalar, List, Dict]] | |||
Frequency = Union[str, "DateOffset"] | |||
Axes = Collection[Any] | |||
Axes = Union[AnyArrayLike, List, Dict, range] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List and Dict are probably too restrictive in general? If we use this in places where we accept list-like or dict-like.
Also from https://github.com/python/typeshed/blob/master/CONTRIBUTING.md#conventions
avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;
This also helps us develop robust code since mypy will report if we try to write into the passed argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only getting applied to the index
and columns
argument of pd.DataFrame
. We can't use Sequence
, as that would allow a string. We document the arguments as Index
or array-like
. I don't think List
is too restrictive - the problem is that Sequence
is too broad because of the issue with strings. I could see removing Dict
since we don't document that it is acceptable.
pandas/core/frame.py
Outdated
@@ -1155,8 +1152,8 @@ def to_string( | |||
... | |||
|
|||
@Substitution( | |||
header_type="bool or sequence of str", | |||
header="Write out the column names. If a list of strings " | |||
header_type="bool or array-like of column names", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment #47215 (comment) about the (imo) vague definition of array-like. what about list-like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we do accept a Series
or Index
here, which I guess is "list-like", so I could do that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we do accept a
Series
orIndex
here, which I guess is "list-like", so I could do that instead.
Turns out we don't accept a Series
or Index
, but do accept a numpy array, but I think we're better off documenting it as a list, and restricting it to a list[str]
. It does have to be a list of strings.
pandas/core/frame.py
Outdated
col_space: int | list[int] | dict[Hashable, int] | None = ..., | ||
header: bool | Sequence[str] = ..., | ||
header: bool | list = ..., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue Sequence
allows a regular string to be passed, which is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing it to list[str]
…rgument for DataFrame. Define ListLike
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@Dr-Irv do you have time to update this PR? |
Sure. It dropped through the cracks as I've been focusing on pandas-stubs |
@mroeschke just did an update |
LGTM cc @twoertwein if you'd like to review |
It's difficulty. On the one side, I prefer simple type annotations even if they might be a tiny bit too wide (include exceptions like If we narrow types here in pandas, it would be great to make such changes first in pandas-stubs (probably already in it), wait for maybe a week after its new release, if no issues have been raised there, create a narrowed type alias in (It would be great if pandas would simply treat |
That's why I think this is a good change to make.
At least for the functions in this PR, IMHO, we should stop using |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
Merged in main |
I think we initially wanted to keep the annotations in pandas-stubs and pandas consistent (as much as possible). I think that this was because we still wanted pandas to be py.typed in the future. Now if we forget about the user experience of typing totally for the pandas codebase that puts us on a totally different course. We now use typing purely for code checks and robustness. Also, we don't stress over the gaps in the public API and can focus more on gradual typing of the lower level code. The concept of gradual typing is now more relevant and completeness less of an issue. For every type annotation we add to the codebase we strengthen the type checking providing additional robustness against future code changes/additions while also uncovering latent bugs along the way in some cases. For this reason, I think I am now happy to U-turn on the avoiding false positives philosophy (in the pandas codebase; I think this is wrong for pandas-stubs but this would no longer be relevant to the discussion here if we are happy to let the two diverge) So if we have a policy of preferring narrower types than wider types for the pandas codebase, the question to consider is what is the policy for dealing with false positives? I think we should always prefer a #type:ignore and a comment to code changes. Then as we become able to define a type that is not to wide and less narrow these #type:ignores can be systematically removed. This should prevent any code changes just to silence the type checker, keep visibility of the problematic types and longer term, allow us to provide more accurate type annotations. There is also the issue here of using List instead. List is mutable so the function code can mutate the argument without the type checker catching this. Is this any better than not catching the passing of a plain string? I could argue that this is potentially worse. The general rule is avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence. |
Given that
Yes, I understand that. It's just unfortunate that an argument type of I should say that as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dr-Irv
Agreed. We do want to keep an one eye on divergence and ensure that we only do things differently where appropriate.
We have much more flexibility with stubs where we can use the latest typing features, not have gaps from dynamically generated code and not require consistency between the signatures and the function body. This means that we can evolve the types in pandas-stubs much quicker e.g use of generic types
The issue I raised here about the mutability of the argument is not an issue for the stubs.
I see @mroeschke approved so if @twoertwein is on board can merge on green.
I would be fine if it is merged as is but I have one simple suggestion and one larger (which probably needs some discussion/testing).
from collections.abc import Hashable, Sequence
from typing import Protocol, TypeVar
_T_co = TypeVar("_T_co", covariant=True)
# "Actual" Sequences accept any object for __contains__ but str is annotated to only accept str (which makes it not a Sequence)
# https://github.com/python/typeshed/blob/6ba28ae54707db9be1a89119bd6eebcc9237395b/stdlib/builtins.pyi#L564
# In practice str is a Sequence, as str inherits from it. This changes if Sequence was a Protocol
class SequenceNotStr(Sequence[_T_co], Protocol[_T_co]): # type: ignore[misc]
...
def fun_list() -> list[str]:
return ["string"]
def fun_tuple() -> tuple[str]:
return ("string",)
def fun_str() -> str:
return "string"
def accept(x: SequenceNotStr[Hashable]) -> None:
...
accept(fun_list()) # works
accept(fun_tuple()) # works
accept(fun_str()) # fails |
Thanks @twoertwein for the input on this. My initial objection was on the basis that pandas-stubs and pandas should stay in sync and that I felt being too narrow for the stubs was the wrong thing to do. My U-turn is on the basis that being too narrow may be better for typing checking of the codebase as long as we have a clear policy on dealing with an increase in false positives. Even though I mentioned that we could be less concerned about keeping pandas-stubs and pandas in sync, my approval here was actually based on keeping the two in sync and not because I think this is the right thing to do. Does this make sense? Given this, I think this discussion about making the types more precise should now move to pandas-stubs and if the types are changed there we should they sync them here? |
I assumed this narrowing is only about removing Has the goal of this PR changed to avoiding wide container types like |
See discussion in #47232 . I did this PR when the goal was to have more alignment between Since I'm more focused on |
I think the fundamental issue is that @Dr-Irv and I disagree on whether narrower or wider types are better for the public API. Since we can't come to an agreement, then I have to accept that pandas-stubs are being developed by @Dr-Irv and I haven't stopped these changes to pandas-stubs. So I am reluctantly agreeing that we merge these changes to keep the pandas-stubs and pandas in sync as much as possible. The reason that @Dr-Irv opened this PR. |
and yes I agree. but let's now do that in pandas-stubs first. The types can be evolved much quicker. Once they have been in the public domain for a period we can then sync them with pandas. |
My comment may have been too flippant. We should not diverge just for the sake of it. We should harness the benefits of using stubs to evolve the types in pandas-stubs without the constraints imposed by strictly keeping the two repos in sync.
If happy to close, then no problem. I'm currently going through the typing issues open on pandas and those that are related to the public api i'm planning to close or move to pandas-stubs repo. This issue of the types being too narrow should be raised in the pandas-stubs repo instead. |
Well, I think both of us are becoming a bit indifferent on this PR. You're fine to merge, or fine to close. I have a slight preference to merge it in, mainly because I did the work, and I think we should still aim for alignment when we can. In this particular case, I think you should merge, because the changes are for the public API within pandas, and alignment there is desirable. It does make maintaining
That's great! |
I must admit I prefer slightly wider annotations for the public API as they lead to simpler annotations (which are easier to maintain and understand) but I think @Dr-Irv's goal is not to always stick to the narrowest type but to write type annotations to catch common(!) mistakes - I really like this goal even if it can create a lot of disagreement/new issues. (common: Even if pandas and pandas-stubs have different goals, I would still like to align them where possible. And I would definitely like to migrate "conflict-free" annotations from pandas-stubs to pandas (including for the public API). I'm obviously a bit biased by proposing the protocol idea (which also doesn't solve everything - probably doesn't work for cases where we like Iterable), I propose we wait if there is a fallout from the typing issue, then discuss over at pandas-stubs whether we like to use the approach there, and if it seems to work well there, use it in pandas. (Ideally, we would have a pandas-stubs-mypy-primer to see how it would affect code by other projects.) |
yes. OK to merge as a policy of maintaining parity. OK to close if @twoertwein opens a issue on pandas-stubs with his proposal instead to resolve the issue of the types being too narrow. So @Dr-Irv your preference is to merge and @twoertwein your preference is to wait? |
Let's just merge it for now. When/if it is time for a different approach, the person might need to check this PR to revert the affected lines manually to not miss cases where SequenceNotstr could be applied. Thanks @Dr-Irv and sorry for the long back and forth. (I will wait a week before opening an issue at pandas-stubs - first want to see whether people have opinions at the python/typing issue.) |
See discussion in #47232 .
This is a follow up to #47231 to disallow
Sequence[str]
as an argument into_string()
andto_html()
. Some of the changes here are the same as in #47231