Skip to content

remove types from Index.__iter__()` #532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 10, 2023
Merged

Conversation

Dr-Irv
Copy link
Collaborator

@Dr-Irv Dr-Irv commented Feb 10, 2023

  • Tests added: Please use assert_type() to assert the type of any return value
    • test_frame.py/test_in_columns()

Without this fix, the following code reports errors:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random((3,4)), columns=["cat", "dog", "rat", "pig"])
cols = [c for c in df.columns if "at" in c]
print(df[cols])

pyright then reports:

colcheck.py:5:34 - error: Operator "in" not supported for types "Literal['at']" and "IndexIterScalar | tuple[Hashable, ...]"
    Operator "in" not supported for types "Literal['at']" and "bytes"
    Operator "in" not supported for types "Literal['at']" and "date"
    Operator "in" not supported for types "Literal['at']" and "datetime"
    Operator "in" not supported for types "Literal['at']" and "timedelta"
    Operator "in" not supported for types "Literal['at']" and "bool"
    Operator "in" not supported for types "Literal['at']" and "int"
    Operator "in" not supported for types "Literal['at']" and "float"
    Operator "in" not supported for types "Literal['at']" and "Timestamp"
    ... (reportGeneralTypeIssues)

mypy reports:

colcheck.py:5: error: Unsupported operand types for in (likely involving Union)
 [operator]
colcheck.py:5: error: Unsupported right operand type for in ("Union[Union[str, bytes, date, datetime, timedelta, bool, int, float, Timestamp, Timedelta], Tuple[Hashable, ...]]")  [operator]

This is because of a change made in #367 by @gandhis1 . Comments welcome from @gandhis1 if I missed something here.

Our application code uses this pattern a lot, and it would be really painful to keep having to do cast when selecting a subset of the columns.

@Dr-Irv Dr-Irv requested a review from twoertwein February 10, 2023 19:10
@@ -223,7 +222,7 @@ class Index(IndexOpsMixin, PandasObject):
def shape(self) -> tuple[int, ...]: ...
# Extra methods from old stubs
def __eq__(self, other: object) -> np_ndarray_bool: ... # type: ignore[override]
def __iter__(self) -> Iterator[IndexIterScalar | tuple[Hashable, ...]]: ...
def __iter__(self) -> Iterator: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to type the return type, shouldn't this simply have been Iterator[Hashable] (tuple and everything in IndexIterScalar) is hashable?

(Having mypy_primer could have been helpful to see the false-positives earlier.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to type the return type, shouldn't this simply have been Iterator[Hashable] (tuple and everything in IndexIterScalar) is hashable?

(Having mypy_primer could have been helpful to see the false-positives earlier.)

If I change it to Iterator[Hashable], then I get this with the test (and example above) from pyright:

Operator "in" not supported for types "Literal['at']" and "Hashable"
  Operator "in" not supported for types "Literal['at']" and "Hashable"

@gandhis1
Copy link
Contributor

gandhis1 commented Feb 10, 2023

In the original issue: #365

I had pointed out this was having issues:

    import pandas as pd
    df = pd.DataFrame(data={"col1": [1, 2], {2: ["a", "b"]})
    test = list(df.columns)
    reveal_type(test)   # List[Unknown]

That is a contrived example, but I believe the result of that is being unable to assign df.columns to a variable and then use said variable to index a DataFrame's columns.

If we make this change, are these common patterns still supported?

cols_to_keep = [c for c in df.columns if c.startswith("keep")]
df1 = df1.loc[:, cols_to_keep]
df2 = df2[cols_to_keep]
df3 = df3.groupby(by=cols_to_keep).sum()

I would have expected that I or someone else added tests for all of these but given that you didn't have to remove any tests, I'm not so sure that the test coverage is there. I may have added tests checking the type when I should have added tests checking the usage pattern.

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Feb 10, 2023

In the original issue: #365

I had pointed out this was having issues:

    import pandas as pd
    df = pd.DataFrame(data={"col1": [1, 2], {2: ["a", "b"]})
    test = list(df.columns)
    reveal_type(test)   # List[Unknown]

That is a contrived example, but I believe the result of that is being unable to assign df.columns to a variable and then use said variable to index a DataFrame's columns.

Since we don't have a generic index yet, we don't know the types that are inside of df.columns or df.index . We just know it is the base class Index. But you can use the untyped List to index into a DataFrame.

If we make this change, are these common patterns still supported?

cols_to_keep = [c for c in df.columns if c.startswith("keep")]
df1 = df1.loc[:, cols_to_keep]
df2 = df2[cols_to_keep]
df3 = df3.groupby(by=cols_to_keep).sum()

Yes, and I added that to the tests in commit 50f4f9f

I would have expected that I or someone else added tests for all of these but given that you didn't have to remove any tests, I'm not so sure that the test coverage is there. I may have added tests checking the type when I should have added tests checking the usage pattern.

I guess we didn't have the tests when one tries to use in to restrict the columns that are chosen.

I did have an idea (but it would take some work to get right) to change pandas where we make Index.tolist() have an optional argument that would be used for typing purposes, so that if you knew what kind of type was in the index (or df.columns), you would write my example as follows:

cols = [c for c in df.columns.tolist(str) if "at" in c]

The return type of tolist() would be list[str] in this case, whereas if you did df.columns.tolist(int), the return type would be list[int], but the values in the list would actually be strings (or whatever were the types in the index itself).

@twoertwein twoertwein merged commit 90db462 into pandas-dev:main Feb 10, 2023
@twoertwein
Copy link
Member

Thanks @Dr-Irv

@gandhis1
Copy link
Contributor

Hm...I guess I'm surprised everything works, because when I originally submitted that bug, I did so in response to something that was broken in my firm's codebase. Maybe it was fixed elsewhere since then. When this is released I'll keep an eye out for any regressions.

@Dr-Irv Dr-Irv deleted the indexiter branch February 27, 2023 22:18
twoertwein pushed a commit to twoertwein/pandas-stubs that referenced this pull request Apr 1, 2023
* remove types from Index.__iter__()`

* add some tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants