Skip to content

ENH: Allow type declaration of dataframes with index other than Index #54378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
davidgilbertson opened this issue Aug 2, 2023 · 5 comments
Closed
1 of 3 tasks
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Typing type annotations, mypy/pyright type checking Upstream issue Issue related to pandas dependency

Comments

@davidgilbertson
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I'm a long time Pandas user, just switched to VS Code which has stronger type checking. I'm struggling to get error-free code when using Dataframes with MultiIndex or DatetimeIndex. E.g. this pretty basic code results in an error in VS Code.

import pandas as pd

df = pd.DataFrame(
    dict(A=[1, 2, 3]),
    index=pd.date_range("2000-01-01", periods=3),
)

days = df.index.day  # Pylance Error: Cannot access member "day" for type "Index"

Feature Description

A few options I can think of (I'm not sure of the viability of any)

  • Built in DatetimeDataFrame, MultiIndexDataFrame types, that can be used to annotate the return value of anything that returns a DataFrame
  • methods return the correct type, e.g. know what sort of DF they're returning. I suspect this isn't possible in all cases, but with clever overrides some cases might be possible.
  • Put all methods on Index and raise errors/noop when called on the wrong type. E.g. Index.day is a valid property, but returns None. Not great as it clutters the auto-complete list.
  • Something in the docs explaining this, I can't be the first to come across this issue, but couldn't see anything in the docs/cookbook/FAQ about this.

Alternative Solutions

I don't know. How to people who use Pandas with VS Code do this? Does everyone just turn off type checking? Is there some obvious step I'm missing?

My workaround is to create a type/class with the right index and assign that as the type, but that in itself is an error (Expression of type "DataFrame" cannot be assigned to declared type "DataFrameDatetimeIndex"), so at the point where I define the type I have to turn off type checking. But at least then I get auto-complete for a DatetimeIndex.

import pandas as pd


class DataFrameDatetimeIndex(pd.DataFrame):
    index: pd.DatetimeIndex


df: DataFrameDatetimeIndex = pd.DataFrame(
    dict(A=[1, 2, 3]),
    index=pd.date_range("2000-01-01", periods=3),
)  # type: ignore


days = df.index.day

The other workaround is to use typing.cast but that means an extra import and seems hacky.

import pandas as pd
import typing


class DataFrameDatetimeIndex(pd.DataFrame):
    index: pd.DatetimeIndex


df = pd.DataFrame(
    dict(A=[1, 2, 3]),
    index=pd.date_range("2000-01-01", periods=3),
)

df = typing.cast(DataFrameDatetimeIndex, df)

days = df.index.day

Additional Context

No response

@davidgilbertson davidgilbertson added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 2, 2023
@twoertwein
Copy link
Member

twoertwein commented Aug 3, 2023

You might want to try installing pandas-stubs. I think VSCode/pyright tries to analyze type annotations even of non-py.typed packages (like pandas). If you have issues after installing pandas-stubs, please report the issue here https://github.com/pandas-dev/pandas-stubs

edit: I believe your examples will still fail with pandas-stubs, but if I get pandas-dev/pandas-stubs#723 (comment) working, there might be a future where DataFrame is generic in terms of the index.

@lithomas1 lithomas1 added Typing type annotations, mypy/pyright type checking Closing Candidate May be closeable, needs more eyeballs Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2023
@davidgilbertson
Copy link
Author

Thanks for your response. So it seems that for now at least, I need to turn off type checking if I'm working with dataframes that don't have a plain index (or manually type everything, but I'm constantly switching from multi-index to single, so this is not tenable). Is that a reasonable conclusion?

I must admit I'm a bit confused by the whole 'stubs' situation with VS Code. Like, how do I tell what stubs packages are being used? The pandas-stubs readme says it ships with the Pylance extension, so wouldn't I already have it? Where do I go to actually see where the type info is coming from?

Also, I'm a little bit confused about types and Pandas. It seems like a lot of the objects have types defined in the core library, yet pandas-stubs appears to be under active development (the stubs readme mentions the round method, but that seems to be typed just fine in Pandas). Is it just the case that it's a work in progress and soon Pandas proper will have a py.typed file and pandas-stubs will be retired? Or are the types in Pandas for internal checking only, and users are expected to also have a stubs package if they want types (or a type checker that ignores the lack of py.typed file).

@twoertwein
Copy link
Member

I'm not sure what is shipped with VSCode (@Dr-Irv might know more). I would recommend explicitly installing pandas-stubs in your python environment if you want to use type checkers with pandas code (even if it is already part of VSCode).

Where do I go to actually see where the type info is coming from?

I think VSCode has a "goto definition" option, that might help. I know pyright (the core type engine behind VScode/PyLance) by default(?) analyzes library code, e.g., pandas, if no explicit stubs are installed, e.g., pandas-stubs.

Is it just the case that it's a work in progress and soon Pandas proper will have a py.typed file and pandas-stubs will be retired?

If/when pandas reaches the py.typed state (this will take years, unless it is more actively pushed), pandas-stubs will be obsolete (because type checkers will not look for external stubs if a library declares itself as py.typed). If you want to improve type annotations, you are welcome to open PRs for pandas and/or pandas-stubs :) I think the main divide between the two is whether the annotations are focused on end users (pandas-stubs) or on pandas developers. I dream of a future where we can just have one: retroactively adding type annotations that are both internally (pandas code) and externally (end user code) consistent is challenging - it is also challenging to write stubs for an API that was created before Python's type system existed (but it is easier).

If you require generic objects (Series[int], ...) , you will have to use pandas-stubs for now (has a generic Series and hopefully also soon a generic Index). If you mainly use popular methods, such as read_csv, and don't care too much about having a lot of Anys or Unions, you will be fine with pyright using the pandas annotations. If you use other type checkers (mypy), you have to use pandas-stubs - I believe pyright is the only type checker that is eager enough to analyze non-py.typed code.

@davidgilbertson
Copy link
Author

That's great, thanks for the clarifications. It sounds like work on this sort of thing is well under way so I'll close this.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 8, 2023

Thanks for your response. So it seems that for now at least, I need to turn off type checking if I'm working with dataframes that don't have a plain index (or manually type everything, but I'm constantly switching from multi-index to single, so this is not tenable). Is that a reasonable conclusion?

Maybe. The issue here is that in a static typing context, we can't track what kind of index is backing the dataframe. For example, methods like set_index() can make the index of the DF any type. If you know the type of the index, best to do things like dtindex: DatetimeIndex = df.index so you then get the methods on DatetimeIndex. (This might require a cast)

I must admit I'm a bit confused by the whole 'stubs' situation with VS Code. Like, how do I tell what stubs packages are being used? The pandas-stubs readme says it ships with the Pylance extension, so wouldn't I already have it? Where do I go to actually see where the type info is coming from?

pandas-stubs is shipped with VS Code, and when we do new releases of pandas-stubs, it takes a few weeks until the new version appears in VS Code. You can do the following:

from pandas._version import _stub_version
reveal_type(_stub_version)

Then in VS Code, the reveal_type() will show which version of the stubs is installed. If you put your mouse over _stub_version (or over any other pandas class or method), and then right-click, and choose "go to declaration", VS Code will open the PYI file that is used to determine the type. Then mouse over the tab with the name of that PYI file, and you can see the path that was used to find the stub.

Also, I'm a little bit confused about types and Pandas. It seems like a lot of the objects have types defined in the core library, yet pandas-stubs appears to be under active development (the stubs readme mentions the round method, but that seems to be typed just fine in Pandas). Is it just the case that it's a work in progress and soon Pandas proper will have a py.typed file and pandas-stubs will be retired? Or are the types in Pandas for internal checking only, and users are expected to also have a stubs package if they want types (or a type checker that ignores the lack of py.typed file).

The types in the pandas source are there for internal type checking of the pandas code for pandas development. The stubs are meant for users. There are a few advantages of using the stubs, IMHO:

  • Type checking will be faster as the type checkers don't have to check the entire pandas code base
  • The types are more expansive and support the typical use cases better
  • pandas-stubs is released more frequently so we can fix bugs faster

On a personal note, my team and I have been using the stubs on our code and it has picked up numerous bugs in our code, and saves us a lot of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Typing type annotations, mypy/pyright type checking Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

4 participants