Remove Sequence[str] as type used in DataFrame.to_string() #47233

Dr-Irv · 2022-06-04T18:10:01Z

See discussion in #47232 .

This is a follow up to #47231 to disallow Sequence[str] as an argument in to_string() and to_html(). Some of the changes here are the same as in #47231

…ctor

simonjayhawkins · 2022-06-05T10:12:48Z

pandas/_typing.py

@@ -115,7 +114,7 @@
 Ordered = Optional[bool]
 JSONSerializable = Optional[Union[PythonScalar, List, Dict]]
 Frequency = Union[str, "DateOffset"]
-Axes = Collection[Any]
+Axes = Union[AnyArrayLike, List, Dict, range]


List and Dict are probably too restrictive in general? If we use this in places where we accept list-like or dict-like.

Also from https://github.com/python/typeshed/blob/master/CONTRIBUTING.md#conventions

avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;

This also helps us develop robust code since mypy will report if we try to write into the passed argument.

This is only getting applied to the index and columns argument of pd.DataFrame. We can't use Sequence, as that would allow a string. We document the arguments as Index or array-like. I don't think List is too restrictive - the problem is that Sequence is too broad because of the issue with strings. I could see removing Dict since we don't document that it is acceptable.

simonjayhawkins · 2022-06-05T10:17:21Z

pandas/core/frame.py

@@ -1155,8 +1152,8 @@ def to_string(
        ...

    @Substitution(
-        header_type="bool or sequence of str",
-        header="Write out the column names. If a list of strings "
+        header_type="bool or array-like of column names",


see my comment #47215 (comment) about the (imo) vague definition of array-like. what about list-like?

Well, we do accept a Series or Index here, which I guess is "list-like", so I could do that instead.

Well, we do accept a Series or Index here, which I guess is "list-like", so I could do that instead.

Turns out we don't accept a Series or Index, but do accept a numpy array, but I think we're better off documenting it as a list, and restricting it to a list[str] . It does have to be a list of strings.

simonjayhawkins · 2022-06-05T10:19:27Z

pandas/core/frame.py

        col_space: int | list[int] | dict[Hashable, int] | None = ...,
-        header: bool | Sequence[str] = ...,
+        header: bool | list = ...,


avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;

Same issue Sequence allows a regular string to be passed, which is invalid.

Changing it to list[str]

…rgument for DataFrame. Define ListLike

github-actions · 2022-07-11T00:08:57Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Dr-Irv · 2022-07-11T14:21:14Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Will wait for #47231 to get merged, then will finish this one.

mroeschke · 2022-10-04T18:32:17Z

@Dr-Irv do you have time to update this PR?

Dr-Irv · 2022-10-04T18:34:58Z

@Dr-Irv do you have time to update this PR?

Sure. It dropped through the cracks as I've been focusing on pandas-stubs

Dr-Irv · 2022-10-04T20:10:01Z

@mroeschke just did an update

mroeschke · 2022-10-05T15:57:12Z

LGTM cc @twoertwein if you'd like to review

twoertwein · 2022-10-05T19:14:47Z

It's difficulty.

On the one side, I prefer simple type annotations even if they might be a tiny bit too wide (include exceptions like str, bytes, and set) as they are easier to read and cover also some unusual types that are technically okay. If I remember correctly (I didn't find it in their docs when searching for it), typeshed falls along a similar line: rather too wide than too narrow. On the other side, str will cause an error for some functions, so it would be in the interest of the inline annotations to catch such potential bugs.

If we narrow types here in pandas, it would be great to make such changes first in pandas-stubs (probably already in it), wait for maybe a week after its new release, if no issues have been raised there, create a narrowed type alias in _typing and add a small comment (something like # <previous too-wide annotation> includes <exception> which causes an error in some cases).

(It would be great if pandas would simply treat str as a sequence - would make it more pythonic: pd.DataFrame(array_with_three_columns, columns="ABC") in some cases user would need to encapsulte their strings in a tuple df.loc[:, "A"] = ("assign_to_each_row_don't_iterate_over_me",) but otherwise I would really like such a change - even if it needs a long deprecation/future-warning phase.)

Dr-Irv · 2022-10-06T01:41:40Z

On the other side, str will cause an error for some functions, so it would be in the interest of the inline annotations to catch such potential bugs.

That's why I think this is a good change to make.

If we narrow types here in pandas, it would be great to make such changes first in pandas-stubs (probably already in it), wait for maybe a week after its new release, if no issues have been raised there, create a narrowed type alias in _typing and add a small comment (something like # <previous too-wide annotation> includes <exception> which causes an error in some cases).

At least for the functions in this PR, Sequence[str] has been removed in pandas-stubs.

IMHO, we should stop using Sequence[str] in both projects unless a plain str is accepted, and, if so, write it as str | Sequence[str] to make it clear that we have looked carefully at the issue. Same could be said for Sequence[Hashable]

simonjayhawkins · 2023-02-22T15:05:15Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Dr-Irv · 2023-02-22T16:38:42Z

Merged in main

simonjayhawkins · 2023-02-23T13:25:56Z

IMHO, we should stop using Sequence[str] in both projects unless a plain str is accepted

I think we initially wanted to keep the annotations in pandas-stubs and pandas consistent (as much as possible). I think that this was because we still wanted pandas to be py.typed in the future.

Now if we forget about the user experience of typing totally for the pandas codebase that puts us on a totally different course. We now use typing purely for code checks and robustness. Also, we don't stress over the gaps in the public API and can focus more on gradual typing of the lower level code.

The concept of gradual typing is now more relevant and completeness less of an issue. For every type annotation we add to the codebase we strengthen the type checking providing additional robustness against future code changes/additions while also uncovering latent bugs along the way in some cases.

For this reason, I think I am now happy to U-turn on the avoiding false positives philosophy (in the pandas codebase; I think this is wrong for pandas-stubs but this would no longer be relevant to the discussion here if we are happy to let the two diverge)

So if we have a policy of preferring narrower types than wider types for the pandas codebase, the question to consider is what is the policy for dealing with false positives?

I think we should always prefer a #type:ignore and a comment to code changes. Then as we become able to define a type that is not to wide and less narrow these #type:ignores can be systematically removed. This should prevent any code changes just to silence the type checker, keep visibility of the problematic types and longer term, allow us to provide more accurate type annotations.

There is also the issue here of using List instead. List is mutable so the function code can mutate the argument without the type checker catching this. Is this any better than not catching the passing of a plain string? I could argue that this is potentially worse.

The general rule is avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence.

simonjayhawkins · 2023-02-23T13:46:38Z

@Dr-Irv will merging this close #47232

Dr-Irv · 2023-02-23T17:20:57Z

@Dr-Irv will merging this close #47232

Only partially. I didn't audit all the pandas methods to see where else it should be removed. Was also working on this 8 months ago, so things may have changed in the code.

Dr-Irv · 2023-02-23T17:29:31Z

For this reason, I think I am now happy to U-turn on the avoiding false positives philosophy (in the pandas codebase; I think this is wrong for pandas-stubs but this would no longer be relevant to the discussion here if we are happy to let the two diverge)

Given that pandas-stubs was created about a week after I did this PR, and how pandas-stubs has evolved, I'd be fine with letting the two diverge.

There is also the issue here of using List instead. List is mutable so the function code can mutate the argument without the type checker catching this. Is this any better than not catching the passing of a plain string? I could argue that this is potentially worse.

The general rule is avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence.

Yes, I understand that. It's just unfortunate that an argument type of Sequence[str] allows a plain string to be passed. When I did this PR, it was with the understanding that we wanted to make the types in the two projects converge.

I should say that as pandas-stubs has evolved, there are times where looking at the pandas source gives us a hint as to what to put in the stubs. There is also the question of whether the docs will pick up what is annotated in the pandas source, which might be too wide or too narrow as compared to what is truly accepted by pandas. I don't think this is an issue today, but more worried about it if someone decides to use the annotations within the pandas source as a way of producing documentation.

simonjayhawkins

Thanks @Dr-Irv

Agreed. We do want to keep an one eye on divergence and ensure that we only do things differently where appropriate.

We have much more flexibility with stubs where we can use the latest typing features, not have gaps from dynamically generated code and not require consistency between the signatures and the function body. This means that we can evolve the types in pandas-stubs much quicker e.g use of generic types

The issue I raised here about the mutability of the argument is not an issue for the stubs.

I see @mroeschke approved so if @twoertwein is on board can merge on green.

twoertwein · 2023-02-23T22:05:00Z

I see @mroeschke approved so if @twoertwein is on board can merge on green.

I would be fine if it is merged as is but I have one simple suggestion and one larger (which probably needs some discussion/testing).

(the small one) Instead of using List (which inherits from MutableSequence which in turn inherits from Sequence), an option is to use MutableSequence: prevents str but accepts more Sequence-like types. Unfortunately, it does NOT accept tuple and does NOT give us the covariant behavior of Sequence.
(the larger one) There is one method of str that is incompatible with Sequence contains (str is technically not a Sequence)! If Sequence was a Protocol (it isn't), we could use the below code to allow tuple/list, exclude str, and have the covariate behavior. With one ignore it seems to work with mypy&pyright (on python 3.11). If we go that route, it could be nicer (but more verbose) to 1:1 convert Sequence into a Protocol. It might also be good to confirm with the typing gods at pyright&mypy.

from collections.abc import Hashable, Sequence
from typing import Protocol, TypeVar

_T_co = TypeVar("_T_co", covariant=True)


# "Actual" Sequences accept any object for __contains__ but str is annotated to only accept str (which makes it not a Sequence)
# https://github.com/python/typeshed/blob/6ba28ae54707db9be1a89119bd6eebcc9237395b/stdlib/builtins.pyi#L564
# In practice str is a Sequence, as str inherits from it. This changes if Sequence was a Protocol
class SequenceNotStr(Sequence[_T_co], Protocol[_T_co]):  # type: ignore[misc]
    ...

def fun_list() -> list[str]:
    return ["string"]


def fun_tuple() -> tuple[str]:
    return ("string",)

def fun_str() -> str:
    return "string"

def accept(x: SequenceNotStr[Hashable]) -> None:
    ...

accept(fun_list()) # works
accept(fun_tuple()) # works
accept(fun_str()) # fails

simonjayhawkins · 2023-02-24T10:45:01Z

Thanks @twoertwein for the input on this.

My initial objection was on the basis that pandas-stubs and pandas should stay in sync and that I felt being too narrow for the stubs was the wrong thing to do.

My U-turn is on the basis that being too narrow may be better for typing checking of the codebase as long as we have a clear policy on dealing with an increase in false positives.

Even though I mentioned that we could be less concerned about keeping pandas-stubs and pandas in sync, my approval here was actually based on keeping the two in sync and not because I think this is the right thing to do.

Does this make sense?

Given this, I think this discussion about making the types more precise should now move to pandas-stubs and if the types are changed there we should they sync them here?

twoertwein · 2023-02-24T12:56:35Z

My U-turn is on the basis that being too narrow may be better for typing checking of the codebase as long as we have a clear policy on dealing with an increase in false positives.

I assumed this narrowing is only about removing str. If that's the case, the above protocol version of Sequence could be a much simpler solution (is covariant, includes tuple and so on but not str; could: waiting for someone on python/typing to say its a terrible idea).

Has the goal of this PR changed to avoiding wide container types like Sequence in favor of long Unions with TypeVars? If there is a simple working solution, I typically prefer it :) Something like SequenceNotStr[Hashable] seems simpler than having to extend an already longer list[HashableT] | tuple[HashableT, ...] (the extending is probably more needed for pandas-stubs than for pandas).

Dr-Irv · 2023-02-24T13:20:14Z

Has the goal of this PR changed to avoiding wide container types like Sequence in favor of long Unions with TypeVars? If there is a simple working solution, I typically prefer it :) Something like SequenceNotStr[Hashable] seems simpler than having to extend an already longer list[HashableT] | tuple[HashableT, ...] (the extending is probably more needed for pandas-stubs than for pandas).

See discussion in #47232 .

I did this PR when the goal was to have more alignment between pandas-stubs and the types in pandas. But if we really don't care about that alignment in the future, then this PR becomes less important.

Since I'm more focused on pandas-stubs these days, I'm fine with letting this PR go, but we do have to make a decision on the philosophy here.

simonjayhawkins · 2023-02-24T13:23:24Z

I think the fundamental issue is that @Dr-Irv and I disagree on whether narrower or wider types are better for the public API.

Since we can't come to an agreement, then I have to accept that pandas-stubs are being developed by @Dr-Irv and I haven't stopped these changes to pandas-stubs.

So I am reluctantly agreeing that we merge these changes to keep the pandas-stubs and pandas in sync as much as possible. The reason that @Dr-Irv opened this PR.

simonjayhawkins · 2023-02-24T13:29:44Z

I assumed this narrowing is only about removing str. If that's the case, the above protocol version of Sequence could be a much simpler solution (is covariant, includes tuple and so on but not str; could: waiting for someone on python/typing to say its a terrible idea).

and yes I agree. but let's now do that in pandas-stubs first. The types can be evolved much quicker. Once they have been in the public domain for a period we can then sync them with pandas.

simonjayhawkins · 2023-02-24T13:48:45Z

I did this PR when the goal was to have more alignment between pandas-stubs and the types in pandas. But if we really don't care about that alignment in the future, then this PR becomes less important.

My comment may have been too flippant. We should not diverge just for the sake of it. We should harness the benefits of using stubs to evolve the types in pandas-stubs without the constraints imposed by strictly keeping the two repos in sync.

Since I'm more focused on pandas-stubs these days, I'm fine with letting this PR go, but we do have to make a decision on the philosophy here.

If happy to close, then no problem. I'm currently going through the typing issues open on pandas and those that are related to the public api i'm planning to close or move to pandas-stubs repo.

This issue of the types being too narrow should be raised in the pandas-stubs repo instead.

Dr-Irv · 2023-02-24T13:59:11Z

So I am reluctantly agreeing that we merge these changes to keep the pandas-stubs and pandas in sync as much as possible. The reason that @Dr-Irv opened this PR.

Since I'm more focused on pandas-stubs these days, I'm fine with letting this PR go, but we do have to make a decision on the philosophy here.

If happy to close, then no problem.

Well, I think both of us are becoming a bit indifferent on this PR. You're fine to merge, or fine to close. I have a slight preference to merge it in, mainly because I did the work, and I think we should still aim for alignment when we can.

In this particular case, I think you should merge, because the changes are for the public API within pandas, and alignment there is desirable. It does make maintaining pandas-stubs a bit easier, as I do look at what is in pandas to resolve various issues. If we diverge, then that always becomes a challenge.

I'm currently going through the typing issues open on pandas and those that are related to the public api i'm planning to close or move to pandas-stubs repo.

That's great!

twoertwein · 2023-02-24T14:21:07Z

I think the fundamental issue is that @Dr-Irv and I disagree on whether narrower or wider types are better for the public API.

I must admit I prefer slightly wider annotations for the public API as they lead to simpler annotations (which are easier to maintain and understand) but I think @Dr-Irv's goal is not to always stick to the narrowest type but to write type annotations to catch common(!) mistakes - I really like this goal even if it can create a lot of disagreement/new issues. (common: str is much more common than a custom sub-class of Sequence). We recently had a PR at pandas-stubs that went the other direction, choosing a wider type, as it would cover the more common mistake.

Even if pandas and pandas-stubs have different goals, I would still like to align them where possible. And I would definitely like to migrate "conflict-free" annotations from pandas-stubs to pandas (including for the public API).

I'm obviously a bit biased by proposing the protocol idea (which also doesn't solve everything - probably doesn't work for cases where we like Iterable), I propose we wait if there is a fallout from the typing issue, then discuss over at pandas-stubs whether we like to use the approach there, and if it seems to work well there, use it in pandas. (Ideally, we would have a pandas-stubs-mypy-primer to see how it would affect code by other projects.)

simonjayhawkins · 2023-02-24T14:42:54Z

Well, I think both of us are becoming a bit indifferent on this PR. You're fine to merge, or fine to close. I have a slight preference to merge it in, mainly because I did the work, and I think we should still aim for alignment when we can.

yes. OK to merge as a policy of maintaining parity. OK to close if @twoertwein opens a issue on pandas-stubs with his proposal instead to resolve the issue of the types being too narrow.

So @Dr-Irv your preference is to merge and @twoertwein your preference is to wait?

twoertwein · 2023-02-24T15:48:07Z

Let's just merge it for now. When/if it is time for a different approach, the person might need to check this PR to revert the affected lines manually to not miss cases where SequenceNotstr could be applied.

Thanks @Dr-Irv and sorry for the long back and forth.

(I will wait a week before opening an issue at pandas-stubs - first want to see whether people have opinions at the python/typing issue.)

Dr-Irv added 2 commits June 4, 2022 12:59

API: Disallow sets as index and columns argument in DataFrame constru…

08fb709

…ctor

TYP: Remove Sequence[str] from to_string, to_html

3e49016

Dr-Irv requested a review from simonjayhawkins June 4, 2022 18:10

simonjayhawkins reviewed Jun 5, 2022

View reviewed changes

Dr-Irv mentioned this pull request Jun 5, 2022

TYP: Remove Sequence[str] in type annotations #47232

Open

use list[str] for header argument. Remove dict as index and columns a…

a9e2849

…rgument for DataFrame. Define ListLike

jreback added the Typing type annotations, mypy/pyright type checking label Jun 5, 2022

Merge branch 'main' into typtostring

d65511c

github-actions bot added the Stale label Jul 11, 2022

mroeschke removed the Stale label Oct 4, 2022

Dr-Irv added 3 commits October 4, 2022 14:43

merge with main

d7d82f5

use Axes rather than Sequence[Hashable]

ca9b7ac

fix up whatsnew - no entry

6da7d4f

mroeschke approved these changes Oct 5, 2022

View reviewed changes

mroeschke added this to the 1.6 milestone Oct 5, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

mroeschke added the Needs Discussion Requires discussion from core team before further action label Dec 17, 2022

mroeschke removed this from the 2.0 milestone Feb 16, 2023

simonjayhawkins closed this Feb 22, 2023

merge in main

bc33a68

Dr-Irv reopened this Feb 22, 2023

simonjayhawkins approved these changes Feb 23, 2023

View reviewed changes

twoertwein mentioned this pull request Feb 24, 2023

Possible to distinguish between Sequence[str]/Iterable[str] and str? python/typing#256

Open

twoertwein merged commit 920c025 into pandas-dev:main Feb 24, 2023

twoertwein mentioned this pull request Sep 24, 2023

TYP: Misc changes for pandas-stubs; use Protocol to avoid str in Sequence #55263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Sequence[str] as type used in DataFrame.to_string() #47233

Remove Sequence[str] as type used in DataFrame.to_string() #47233

Dr-Irv commented Jun 4, 2022

simonjayhawkins Jun 5, 2022

Dr-Irv Jun 5, 2022

simonjayhawkins Jun 5, 2022

Dr-Irv Jun 5, 2022

Dr-Irv Jun 5, 2022

simonjayhawkins Jun 5, 2022

Dr-Irv Jun 5, 2022

Dr-Irv Jun 5, 2022

github-actions bot commented Jul 11, 2022

Dr-Irv commented Jul 11, 2022

mroeschke commented Oct 4, 2022

Dr-Irv commented Oct 4, 2022

Dr-Irv commented Oct 4, 2022

mroeschke commented Oct 5, 2022

twoertwein commented Oct 5, 2022

Dr-Irv commented Oct 6, 2022

simonjayhawkins commented Feb 22, 2023

Dr-Irv commented Feb 22, 2023

simonjayhawkins commented Feb 23, 2023

simonjayhawkins commented Feb 23, 2023

Dr-Irv commented Feb 23, 2023

Dr-Irv commented Feb 23, 2023

simonjayhawkins left a comment

twoertwein commented Feb 23, 2023

simonjayhawkins commented Feb 24, 2023

twoertwein commented Feb 24, 2023

Dr-Irv commented Feb 24, 2023

simonjayhawkins commented Feb 24, 2023 •

edited

Loading

simonjayhawkins commented Feb 24, 2023 •

edited

Loading

simonjayhawkins commented Feb 24, 2023

Dr-Irv commented Feb 24, 2023

twoertwein commented Feb 24, 2023 •

edited

Loading

simonjayhawkins commented Feb 24, 2023

twoertwein commented Feb 24, 2023

Remove Sequence[str] as type used in DataFrame.to_string() #47233

Remove Sequence[str] as type used in DataFrame.to_string() #47233

Conversation

Dr-Irv commented Jun 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 11, 2022

Dr-Irv commented Jul 11, 2022

mroeschke commented Oct 4, 2022

Dr-Irv commented Oct 4, 2022

Dr-Irv commented Oct 4, 2022

mroeschke commented Oct 5, 2022

twoertwein commented Oct 5, 2022

Dr-Irv commented Oct 6, 2022

simonjayhawkins commented Feb 22, 2023

Dr-Irv commented Feb 22, 2023

simonjayhawkins commented Feb 23, 2023

simonjayhawkins commented Feb 23, 2023

Dr-Irv commented Feb 23, 2023

Dr-Irv commented Feb 23, 2023

simonjayhawkins left a comment

Choose a reason for hiding this comment

twoertwein commented Feb 23, 2023

simonjayhawkins commented Feb 24, 2023

twoertwein commented Feb 24, 2023

Dr-Irv commented Feb 24, 2023

simonjayhawkins commented Feb 24, 2023 • edited Loading

simonjayhawkins commented Feb 24, 2023 • edited Loading

simonjayhawkins commented Feb 24, 2023

Dr-Irv commented Feb 24, 2023

twoertwein commented Feb 24, 2023 • edited Loading

simonjayhawkins commented Feb 24, 2023

twoertwein commented Feb 24, 2023

simonjayhawkins commented Feb 24, 2023 •

edited

Loading

simonjayhawkins commented Feb 24, 2023 •

edited

Loading

twoertwein commented Feb 24, 2023 •

edited

Loading