remove types from Index.iter()` #532

Dr-Irv · 2023-02-10T19:09:55Z

Tests added: Please use assert_type() to assert the type of any return value
- test_frame.py/test_in_columns()

Without this fix, the following code reports errors:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random((3,4)), columns=["cat", "dog", "rat", "pig"])
cols = [c for c in df.columns if "at" in c]
print(df[cols])

pyright then reports:

colcheck.py:5:34 - error: Operator "in" not supported for types "Literal['at']" and "IndexIterScalar | tuple[Hashable, ...]"
    Operator "in" not supported for types "Literal['at']" and "bytes"
    Operator "in" not supported for types "Literal['at']" and "date"
    Operator "in" not supported for types "Literal['at']" and "datetime"
    Operator "in" not supported for types "Literal['at']" and "timedelta"
    Operator "in" not supported for types "Literal['at']" and "bool"
    Operator "in" not supported for types "Literal['at']" and "int"
    Operator "in" not supported for types "Literal['at']" and "float"
    Operator "in" not supported for types "Literal['at']" and "Timestamp"
    ... (reportGeneralTypeIssues)

mypy reports:

colcheck.py:5: error: Unsupported operand types for in (likely involving Union)
 [operator]
colcheck.py:5: error: Unsupported right operand type for in ("Union[Union[str, bytes, date, datetime, timedelta, bool, int, float, Timestamp, Timedelta], Tuple[Hashable, ...]]")  [operator]

This is because of a change made in #367 by @gandhis1 . Comments welcome from @gandhis1 if I missed something here.

Our application code uses this pattern a lot, and it would be really painful to keep having to do cast when selecting a subset of the columns.

twoertwein · 2023-02-10T19:36:24Z

pandas-stubs/core/indexes/base.pyi

@@ -223,7 +222,7 @@ class Index(IndexOpsMixin, PandasObject):
    def shape(self) -> tuple[int, ...]: ...
    # Extra methods from old stubs
    def __eq__(self, other: object) -> np_ndarray_bool: ...  # type: ignore[override]
-    def __iter__(self) -> Iterator[IndexIterScalar | tuple[Hashable, ...]]: ...
+    def __iter__(self) -> Iterator: ...


If we were to type the return type, shouldn't this simply have been Iterator[Hashable] (tuple and everything in IndexIterScalar) is hashable?

(Having mypy_primer could have been helpful to see the false-positives earlier.)

#367 (comment)

If we were to type the return type, shouldn't this simply have been Iterator[Hashable] (tuple and everything in IndexIterScalar) is hashable?

(Having mypy_primer could have been helpful to see the false-positives earlier.)

If I change it to Iterator[Hashable], then I get this with the test (and example above) from pyright:

Operator "in" not supported for types "Literal['at']" and "Hashable" Operator "in" not supported for types "Literal['at']" and "Hashable"

gandhis1 · 2023-02-10T20:36:28Z

In the original issue: #365

I had pointed out this was having issues:

    import pandas as pd
    df = pd.DataFrame(data={"col1": [1, 2], {2: ["a", "b"]})
    test = list(df.columns)
    reveal_type(test)   # List[Unknown]

That is a contrived example, but I believe the result of that is being unable to assign df.columns to a variable and then use said variable to index a DataFrame's columns.

If we make this change, are these common patterns still supported?

cols_to_keep = [c for c in df.columns if c.startswith("keep")]
df1 = df1.loc[:, cols_to_keep]
df2 = df2[cols_to_keep]
df3 = df3.groupby(by=cols_to_keep).sum()

I would have expected that I or someone else added tests for all of these but given that you didn't have to remove any tests, I'm not so sure that the test coverage is there. I may have added tests checking the type when I should have added tests checking the usage pattern.

Dr-Irv · 2023-02-10T20:56:57Z

In the original issue: #365

I had pointed out this was having issues:
    import pandas as pd
    df = pd.DataFrame(data={"col1": [1, 2], {2: ["a", "b"]})
    test = list(df.columns)
    reveal_type(test)   # List[Unknown]
That is a contrived example, but I believe the result of that is being unable to assign df.columns to a variable and then use said variable to index a DataFrame's columns.

Since we don't have a generic index yet, we don't know the types that are inside of df.columns or df.index . We just know it is the base class Index. But you can use the untyped List to index into a DataFrame.

If we make this change, are these common patterns still supported?

cols_to_keep = [c for c in df.columns if c.startswith("keep")]
df1 = df1.loc[:, cols_to_keep]
df2 = df2[cols_to_keep]
df3 = df3.groupby(by=cols_to_keep).sum()

Yes, and I added that to the tests in commit 50f4f9f

I would have expected that I or someone else added tests for all of these but given that you didn't have to remove any tests, I'm not so sure that the test coverage is there. I may have added tests checking the type when I should have added tests checking the usage pattern.

I guess we didn't have the tests when one tries to use in to restrict the columns that are chosen.

I did have an idea (but it would take some work to get right) to change pandas where we make Index.tolist() have an optional argument that would be used for typing purposes, so that if you knew what kind of type was in the index (or df.columns), you would write my example as follows:

cols = [c for c in df.columns.tolist(str) if "at" in c]

The return type of tolist() would be list[str] in this case, whereas if you did df.columns.tolist(int), the return type would be list[int], but the values in the list would actually be strings (or whatever were the types in the index itself).

twoertwein · 2023-02-10T23:47:52Z

Thanks @Dr-Irv

gandhis1 · 2023-02-11T00:24:02Z

Hm...I guess I'm surprised everything works, because when I originally submitted that bug, I did so in response to something that was broken in my firm's codebase. Maybe it was fixed elsewhere since then. When this is released I'll keep an eye out for any regressions.

* remove types from Index.__iter__()` * add some tests

remove types from Index.__iter__()`

1dd248b

Dr-Irv requested a review from twoertwein February 10, 2023 19:10

twoertwein reviewed Feb 10, 2023

View reviewed changes

add some tests

50f4f9f

twoertwein approved these changes Feb 10, 2023

View reviewed changes

twoertwein merged commit 90db462 into pandas-dev:main Feb 10, 2023

Dr-Irv deleted the indexiter branch February 27, 2023 22:18

twoertwein pushed a commit to twoertwein/pandas-stubs that referenced this pull request Apr 1, 2023

remove types from Index.__iter__()` (pandas-dev#532)

eb9a1e5

* remove types from Index.__iter__()` * add some tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove types from Index.iter()` #532

remove types from Index.iter()` #532

Dr-Irv commented Feb 10, 2023

twoertwein Feb 10, 2023

gandhis1 Feb 10, 2023

Dr-Irv Feb 10, 2023

gandhis1 commented Feb 10, 2023 •

edited

Loading

Dr-Irv commented Feb 10, 2023

twoertwein commented Feb 10, 2023

gandhis1 commented Feb 11, 2023

remove types from Index.__iter__()` #532

remove types from Index.__iter__()` #532

Conversation

Dr-Irv commented Feb 10, 2023

twoertwein Feb 10, 2023

Choose a reason for hiding this comment

gandhis1 Feb 10, 2023

Choose a reason for hiding this comment

Dr-Irv Feb 10, 2023

Choose a reason for hiding this comment

gandhis1 commented Feb 10, 2023 • edited Loading

Dr-Irv commented Feb 10, 2023

twoertwein commented Feb 10, 2023

gandhis1 commented Feb 11, 2023

remove types from Index.iter()` #532

remove types from Index.iter()` #532

gandhis1 commented Feb 10, 2023 •

edited

Loading