Skip to content

TYP: Fix typing issues, mainly Something => IndexLabel #53469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 10, 2023

Conversation

behrenhoff
Copy link
Contributor

@behrenhoff behrenhoff commented May 31, 2023

Several typing annotations in Pandas are not correct. This patch tries to fix some of them.

The internal names in Pandas are not straight forward. An IndexLabel is used for one or multiple columns.

  • value_counts also works for a single column
  • stack/unstack can work with multiple levels at the same time

I have the feeling that there might be more such problems in the code. There are also some incorrect annotations in pandas-stubs mainly related to one/multiple index levels (refer to pandas-dev/pandas-stubs#703 - even though I haven't had time for a PR over there)

@behrenhoff behrenhoff force-pushed the fix-types branch 2 times, most recently from ebeb865 to 03dd4f7 Compare May 31, 2023 16:26
@behrenhoff
Copy link
Contributor Author

behrenhoff commented May 31, 2023

I have a general request: I would like to have better names in general.

Axis (singular) is int or a Literal['columns', ...]. So far so good.
Axes (plural) is a ListLike and is NOT used for multiple axis but instead for columns: Axes - a list of columns (or rows)
IndexLabel seems to be used for one or multiple columns (or rows) - it is a Sequence[Hashable]|Hashable. So that means it does not support a pd.Index?

There is also Level = Hashable. When functions take one or multiple levels, that makes it Level | Sequence[Level] - or IndexLabel again. Is that a good name or should there be a LevelOrLevels alias? (I hate "Or" in names but cannot come up with something better that is as clear)

It is not always clear to me which type to use where and when.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 1, 2023

Just a general comment here...

The types used in the pandas code are meant to help with keeping the pandas code consistent.

The types in pandas-stubs are meant for public consumption.

There are some inconsistencies in naming between the two projects. See pandas-dev/pandas-stubs#128 and pandas-dev/pandas-stubs#161 for some discussion.

I'm not going to look at the details of this PR, but if your goal is to have better naming for use in type checking user code that calls pandas, it might be a better contribution to do something in the pandas-stubs project instead of in pandas, or you could do both.

# Force MultiIndex for single column
if is_list_like(subset) and len(subset) == 1:
# Force MultiIndex for a list_like subset with a single column
if is_list_like(subset) and len(counts.index.names) == 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes because is_list_like is not a TypeGuard, so it isn't clear from the types that len(subset) can be run

@behrenhoff
Copy link
Contributor Author

I'm not going to look at the details of this PR, but if your goal is to have better naming for use in type checking user code that calls pandas

No, my goal is to fix wrong type hints in Pandas, specifically:

  • value_counts also works for a single column but the current type hints don't cover that
  • stack/unstack can work with multiple levels at the same time but the current type hints don't cover that

The naming is just an additional issue which makes it more difficult than it should be. I shouldn't have raised that topic here.

# Force MultiIndex for single column
if is_list_like(subset) and len(subset) == 1:
# Force MultiIndex for a list_like subset with a single column
if is_list_like(subset) and len(list(subset)) == 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line causes the error. The problem is that is_list_like does not return a sort of TypeGuard, so I don't know how to get len(subset) here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to keep the code the same but add an ignore comment instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

@@ -7375,7 +7375,7 @@ def swaplevel(self, i: Axis = -2, j: Axis = -1, axis: Axis = 0) -> DataFrame:
result.columns = result.columns.swaplevel(i, j)
return result

def reorder_levels(self, order: Sequence[int | str], axis: Axis = 0) -> DataFrame:
def reorder_levels(self, order: Sequence[Hashable], axis: Axis = 0) -> DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also update the doc-string: it specifically mentions only int/str right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I undid this change because I fear I would have to dig much deeper. Further functions called by reorder_levels are typed with int | str where actually a lot of other types work as well - but I am not sure whether this is accidentally supported or not. It seems you can have a float label, for example - not that I've come across a use case.

In order not to dig deeper here, I got rid of this change and am limiting this commit to value_counts, stack and unstack where the changes should hopefully be uncontroversial. (lets hope I did the ignore comment correctly)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 5, 2023

No, my goal is to fix wrong type hints in Pandas, specifically:

Wrong type hints for pandas users? Or for pandas developers?

If it is the former, then you should also look at pandas-stubs . If it is the latter, then that's fine.

If you are using the pandas source to type check your user code, then it will be incomplete as compared to using pandas-stubs .

@twoertwein twoertwein merged commit 5cf5c73 into pandas-dev:main Jun 10, 2023
@twoertwein
Copy link
Member

Thanks @behrenhoff!

@mroeschke mroeschke added this to the 2.1 milestone Jun 10, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants