Skip to content

Remove Sequence[str] as type used in DataFrame.to_string() #47233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 24, 2023

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Jun 4, 2022

See discussion in #47232 .

This is a follow up to #47231 to disallow Sequence[str] as an argument in to_string() and to_html(). Some of the changes here are the same as in #47231

@Dr-Irv Dr-Irv requested a review from simonjayhawkins June 4, 2022 18:10
@@ -115,7 +114,7 @@
Ordered = Optional[bool]
JSONSerializable = Optional[Union[PythonScalar, List, Dict]]
Frequency = Union[str, "DateOffset"]
Axes = Collection[Any]
Axes = Union[AnyArrayLike, List, Dict, range]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List and Dict are probably too restrictive in general? If we use this in places where we accept list-like or dict-like.

Also from https://github.com/python/typeshed/blob/master/CONTRIBUTING.md#conventions

avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;

This also helps us develop robust code since mypy will report if we try to write into the passed argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only getting applied to the index and columns argument of pd.DataFrame. We can't use Sequence, as that would allow a string. We document the arguments as Index or array-like. I don't think List is too restrictive - the problem is that Sequence is too broad because of the issue with strings. I could see removing Dict since we don't document that it is acceptable.

@@ -1155,8 +1152,8 @@ def to_string(
...

@Substitution(
header_type="bool or sequence of str",
header="Write out the column names. If a list of strings "
header_type="bool or array-like of column names",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment #47215 (comment) about the (imo) vague definition of array-like. what about list-like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we do accept a Series or Index here, which I guess is "list-like", so I could do that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we do accept a Series or Index here, which I guess is "list-like", so I could do that instead.

Turns out we don't accept a Series or Index, but do accept a numpy array, but I think we're better off documenting it as a list, and restricting it to a list[str] . It does have to be a list of strings.

col_space: int | list[int] | dict[Hashable, int] | None = ...,
header: bool | Sequence[str] = ...,
header: bool | list = ...,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue Sequence allows a regular string to be passed, which is invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it to list[str]

@jreback jreback added the Typing type annotations, mypy/pyright type checking label Jun 5, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 11, 2022
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jul 11, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Will wait for #47231 to get merged, then will finish this one.

@mroeschke
Copy link
Member

@Dr-Irv do you have time to update this PR?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 4, 2022

@Dr-Irv do you have time to update this PR?

Sure. It dropped through the cracks as I've been focusing on pandas-stubs

@mroeschke mroeschke removed the Stale label Oct 4, 2022
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 4, 2022

@mroeschke just did an update

@mroeschke mroeschke added this to the 1.6 milestone Oct 5, 2022
@mroeschke
Copy link
Member

LGTM cc @twoertwein if you'd like to review

@twoertwein
Copy link
Member

It's difficulty.

On the one side, I prefer simple type annotations even if they might be a tiny bit too wide (include exceptions like str, bytes, and set) as they are easier to read and cover also some unusual types that are technically okay. If I remember correctly (I didn't find it in their docs when searching for it), typeshed falls along a similar line: rather too wide than too narrow. On the other side, str will cause an error for some functions, so it would be in the interest of the inline annotations to catch such potential bugs.

If we narrow types here in pandas, it would be great to make such changes first in pandas-stubs (probably already in it), wait for maybe a week after its new release, if no issues have been raised there, create a narrowed type alias in _typing and add a small comment (something like # <previous too-wide annotation> includes <exception> which causes an error in some cases).

(It would be great if pandas would simply treat str as a sequence - would make it more pythonic: pd.DataFrame(array_with_three_columns, columns="ABC") in some cases user would need to encapsulte their strings in a tuple df.loc[:, "A"] = ("assign_to_each_row_don't_iterate_over_me",) but otherwise I would really like such a change - even if it needs a long deprecation/future-warning phase.)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 6, 2022

On the other side, str will cause an error for some functions, so it would be in the interest of the inline annotations to catch such potential bugs.

That's why I think this is a good change to make.

If we narrow types here in pandas, it would be great to make such changes first in pandas-stubs (probably already in it), wait for maybe a week after its new release, if no issues have been raised there, create a narrowed type alias in _typing and add a small comment (something like # <previous too-wide annotation> includes <exception> which causes an error in some cases).

At least for the functions in this PR, Sequence[str] has been removed in pandas-stubs.

IMHO, we should stop using Sequence[str] in both projects unless a plain str is accepted, and, if so, write it as str | Sequence[str] to make it clear that we have looked carefully at the issue. Same could be said for Sequence[Hashable]

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Dec 17, 2022
@mroeschke mroeschke removed this from the 2.0 milestone Feb 16, 2023
@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 22, 2023

Merged in main

@Dr-Irv Dr-Irv reopened this Feb 22, 2023
@simonjayhawkins
Copy link
Member

IMHO, we should stop using Sequence[str] in both projects unless a plain str is accepted

I think we initially wanted to keep the annotations in pandas-stubs and pandas consistent (as much as possible). I think that this was because we still wanted pandas to be py.typed in the future.

Now if we forget about the user experience of typing totally for the pandas codebase that puts us on a totally different course. We now use typing purely for code checks and robustness. Also, we don't stress over the gaps in the public API and can focus more on gradual typing of the lower level code.

The concept of gradual typing is now more relevant and completeness less of an issue. For every type annotation we add to the codebase we strengthen the type checking providing additional robustness against future code changes/additions while also uncovering latent bugs along the way in some cases.

For this reason, I think I am now happy to U-turn on the avoiding false positives philosophy (in the pandas codebase; I think this is wrong for pandas-stubs but this would no longer be relevant to the discussion here if we are happy to let the two diverge)

So if we have a policy of preferring narrower types than wider types for the pandas codebase, the question to consider is what is the policy for dealing with false positives?

I think we should always prefer a #type:ignore and a comment to code changes. Then as we become able to define a type that is not to wide and less narrow these #type:ignores can be systematically removed. This should prevent any code changes just to silence the type checker, keep visibility of the problematic types and longer term, allow us to provide more accurate type annotations.

There is also the issue here of using List instead. List is mutable so the function code can mutate the argument without the type checker catching this. Is this any better than not catching the passing of a plain string? I could argue that this is potentially worse.

The general rule is avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence.

@simonjayhawkins
Copy link
Member

@Dr-Irv will merging this close #47232

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 23, 2023

@Dr-Irv will merging this close #47232

Only partially. I didn't audit all the pandas methods to see where else it should be removed. Was also working on this 8 months ago, so things may have changed in the code.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 23, 2023

For this reason, I think I am now happy to U-turn on the avoiding false positives philosophy (in the pandas codebase; I think this is wrong for pandas-stubs but this would no longer be relevant to the discussion here if we are happy to let the two diverge)

Given that pandas-stubs was created about a week after I did this PR, and how pandas-stubs has evolved, I'd be fine with letting the two diverge.

There is also the issue here of using List instead. List is mutable so the function code can mutate the argument without the type checker catching this. Is this any better than not catching the passing of a plain string? I could argue that this is potentially worse.

The general rule is avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence.

Yes, I understand that. It's just unfortunate that an argument type of Sequence[str] allows a plain string to be passed. When I did this PR, it was with the understanding that we wanted to make the types in the two projects converge.

I should say that as pandas-stubs has evolved, there are times where looking at the pandas source gives us a hint as to what to put in the stubs. There is also the question of whether the docs will pick up what is annotated in the pandas source, which might be too wide or too narrow as compared to what is truly accepted by pandas. I don't think this is an issue today, but more worried about it if someone decides to use the annotations within the pandas source as a way of producing documentation.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dr-Irv

Agreed. We do want to keep an one eye on divergence and ensure that we only do things differently where appropriate.

We have much more flexibility with stubs where we can use the latest typing features, not have gaps from dynamically generated code and not require consistency between the signatures and the function body. This means that we can evolve the types in pandas-stubs much quicker e.g use of generic types

The issue I raised here about the mutability of the argument is not an issue for the stubs.

I see @mroeschke approved so if @twoertwein is on board can merge on green.

@twoertwein
Copy link
Member

I see @mroeschke approved so if @twoertwein is on board can merge on green.

I would be fine if it is merged as is but I have one simple suggestion and one larger (which probably needs some discussion/testing).

  1. (the small one) Instead of using List (which inherits from MutableSequence which in turn inherits from Sequence), an option is to use MutableSequence: prevents str but accepts more Sequence-like types. Unfortunately, it does NOT accept tuple and does NOT give us the covariant behavior of Sequence.
  2. (the larger one) There is one method of str that is incompatible with Sequence contains (str is technically not a Sequence)! If Sequence was a Protocol (it isn't), we could use the below code to allow tuple/list, exclude str, and have the covariate behavior. With one ignore it seems to work with mypy&pyright (on python 3.11). If we go that route, it could be nicer (but more verbose) to 1:1 convert Sequence into a Protocol. It might also be good to confirm with the typing gods at pyright&mypy.
from collections.abc import Hashable, Sequence
from typing import Protocol, TypeVar

_T_co = TypeVar("_T_co", covariant=True)


# "Actual" Sequences accept any object for __contains__ but str is annotated to only accept str (which makes it not a Sequence)
# https://github.com/python/typeshed/blob/6ba28ae54707db9be1a89119bd6eebcc9237395b/stdlib/builtins.pyi#L564
# In practice str is a Sequence, as str inherits from it. This changes if Sequence was a Protocol
class SequenceNotStr(Sequence[_T_co], Protocol[_T_co]):  # type: ignore[misc]
    ...

def fun_list() -> list[str]:
    return ["string"]


def fun_tuple() -> tuple[str]:
    return ("string",)

def fun_str() -> str:
    return "string"

def accept(x: SequenceNotStr[Hashable]) -> None:
    ...

accept(fun_list()) # works
accept(fun_tuple()) # works
accept(fun_str()) # fails

@simonjayhawkins
Copy link
Member

Thanks @twoertwein for the input on this.

My initial objection was on the basis that pandas-stubs and pandas should stay in sync and that I felt being too narrow for the stubs was the wrong thing to do.

My U-turn is on the basis that being too narrow may be better for typing checking of the codebase as long as we have a clear policy on dealing with an increase in false positives.

Even though I mentioned that we could be less concerned about keeping pandas-stubs and pandas in sync, my approval here was actually based on keeping the two in sync and not because I think this is the right thing to do.

Does this make sense?

Given this, I think this discussion about making the types more precise should now move to pandas-stubs and if the types are changed there we should they sync them here?

@twoertwein
Copy link
Member

My U-turn is on the basis that being too narrow may be better for typing checking of the codebase as long as we have a clear policy on dealing with an increase in false positives.

I assumed this narrowing is only about removing str. If that's the case, the above protocol version of Sequence could be a much simpler solution (is covariant, includes tuple and so on but not str; could: waiting for someone on python/typing to say its a terrible idea).

Has the goal of this PR changed to avoiding wide container types like Sequence in favor of long Unions with TypeVars? If there is a simple working solution, I typically prefer it :) Something like SequenceNotStr[Hashable] seems simpler than having to extend an already longer list[HashableT] | tuple[HashableT, ...] (the extending is probably more needed for pandas-stubs than for pandas).

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 24, 2023

Has the goal of this PR changed to avoiding wide container types like Sequence in favor of long Unions with TypeVars? If there is a simple working solution, I typically prefer it :) Something like SequenceNotStr[Hashable] seems simpler than having to extend an already longer list[HashableT] | tuple[HashableT, ...] (the extending is probably more needed for pandas-stubs than for pandas).

See discussion in #47232 .

I did this PR when the goal was to have more alignment between pandas-stubs and the types in pandas. But if we really don't care about that alignment in the future, then this PR becomes less important.

Since I'm more focused on pandas-stubs these days, I'm fine with letting this PR go, but we do have to make a decision on the philosophy here.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Feb 24, 2023

I think the fundamental issue is that @Dr-Irv and I disagree on whether narrower or wider types are better for the public API.

Since we can't come to an agreement, then I have to accept that pandas-stubs are being developed by @Dr-Irv and I haven't stopped these changes to pandas-stubs.

So I am reluctantly agreeing that we merge these changes to keep the pandas-stubs and pandas in sync as much as possible. The reason that @Dr-Irv opened this PR.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Feb 24, 2023

I assumed this narrowing is only about removing str. If that's the case, the above protocol version of Sequence could be a much simpler solution (is covariant, includes tuple and so on but not str; could: waiting for someone on python/typing to say its a terrible idea).

and yes I agree. but let's now do that in pandas-stubs first. The types can be evolved much quicker. Once they have been in the public domain for a period we can then sync them with pandas.

@simonjayhawkins
Copy link
Member

I did this PR when the goal was to have more alignment between pandas-stubs and the types in pandas. But if we really don't care about that alignment in the future, then this PR becomes less important.

My comment may have been too flippant. We should not diverge just for the sake of it. We should harness the benefits of using stubs to evolve the types in pandas-stubs without the constraints imposed by strictly keeping the two repos in sync.

Since I'm more focused on pandas-stubs these days, I'm fine with letting this PR go, but we do have to make a decision on the philosophy here.

If happy to close, then no problem. I'm currently going through the typing issues open on pandas and those that are related to the public api i'm planning to close or move to pandas-stubs repo.

This issue of the types being too narrow should be raised in the pandas-stubs repo instead.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 24, 2023

So I am reluctantly agreeing that we merge these changes to keep the pandas-stubs and pandas in sync as much as possible. The reason that @Dr-Irv opened this PR.

Since I'm more focused on pandas-stubs these days, I'm fine with letting this PR go, but we do have to make a decision on the philosophy here.

If happy to close, then no problem.

Well, I think both of us are becoming a bit indifferent on this PR. You're fine to merge, or fine to close. I have a slight preference to merge it in, mainly because I did the work, and I think we should still aim for alignment when we can.

In this particular case, I think you should merge, because the changes are for the public API within pandas, and alignment there is desirable. It does make maintaining pandas-stubs a bit easier, as I do look at what is in pandas to resolve various issues. If we diverge, then that always becomes a challenge.

I'm currently going through the typing issues open on pandas and those that are related to the public api i'm planning to close or move to pandas-stubs repo.

That's great!

@twoertwein
Copy link
Member

twoertwein commented Feb 24, 2023

I think the fundamental issue is that @Dr-Irv and I disagree on whether narrower or wider types are better for the public API.

I must admit I prefer slightly wider annotations for the public API as they lead to simpler annotations (which are easier to maintain and understand) but I think @Dr-Irv's goal is not to always stick to the narrowest type but to write type annotations to catch common(!) mistakes - I really like this goal even if it can create a lot of disagreement/new issues. (common: str is much more common than a custom sub-class of Sequence). We recently had a PR at pandas-stubs that went the other direction, choosing a wider type, as it would cover the more common mistake.

Even if pandas and pandas-stubs have different goals, I would still like to align them where possible. And I would definitely like to migrate "conflict-free" annotations from pandas-stubs to pandas (including for the public API).

I'm obviously a bit biased by proposing the protocol idea (which also doesn't solve everything - probably doesn't work for cases where we like Iterable), I propose we wait if there is a fallout from the typing issue, then discuss over at pandas-stubs whether we like to use the approach there, and if it seems to work well there, use it in pandas. (Ideally, we would have a pandas-stubs-mypy-primer to see how it would affect code by other projects.)

@simonjayhawkins
Copy link
Member

Well, I think both of us are becoming a bit indifferent on this PR. You're fine to merge, or fine to close. I have a slight preference to merge it in, mainly because I did the work, and I think we should still aim for alignment when we can.

yes. OK to merge as a policy of maintaining parity. OK to close if @twoertwein opens a issue on pandas-stubs with his proposal instead to resolve the issue of the types being too narrow.

So @Dr-Irv your preference is to merge and @twoertwein your preference is to wait?

@twoertwein
Copy link
Member

Let's just merge it for now. When/if it is time for a different approach, the person might need to check this PR to revert the affected lines manually to not miss cases where SequenceNotstr could be applied.

Thanks @Dr-Irv and sorry for the long back and forth.

(I will wait a week before opening an issue at pandas-stubs - first want to see whether people have opinions at the python/typing issue.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants