|
| 1 | +# pandas-stubs Type Checking Philosophy |
| 2 | + |
| 3 | +The goal of the pandas-stubs project is to provide type stubs for the public API |
| 4 | +that represent the recommended ways of using pandas. This is opposed to the |
| 5 | +philosophy within the pandas source, as described [here](https://pandas.pydata.org/docs/development/contributing_codebase.html?highlight=typing#type-hints), which |
| 6 | +is to assist with the development of the pandas source code to ensure type safety within |
| 7 | +that source. |
| 8 | + |
| 9 | +Due to the methodology used by Microsoft to develop the original stubs, there are internal |
| 10 | +classes, methods and functions that are annotated within the pandas-stubs project |
| 11 | +that are incorrect with respect to the pandas source, but that have no effect on type |
| 12 | +checking user code that calls the public API. |
| 13 | + |
| 14 | +## Use of Generic Types |
| 15 | + |
| 16 | +There are other differences that are extensions of the pandas API to assist in type |
| 17 | +checking. Two key examples are that `Series` and `Interval` are typed as generic types. |
| 18 | + |
| 19 | +### Series are Generic |
| 20 | + |
| 21 | +`Series` is declared as `Series[S1]` where `S1` is a `TypeVar` consisting of types normally |
| 22 | +used within series, if that type can be inferred. Consider the following example |
| 23 | +that compares the values in a `Series` to an integer. |
| 24 | + |
| 25 | +```python |
| 26 | +s = pd.Series([1, 2, 3]) |
| 27 | +lt = s < 3 |
| 28 | +``` |
| 29 | + |
| 30 | +In the pandas source, `lt` is a `Series` with a `dtype` of `bool`. In the pandas-stubs, |
| 31 | +the type of `lt` is `Series[bool]`. This allows further type checking to occur in other |
| 32 | +pandas methods. Note that in the above example, `s` is typed as `Series[Any]` because |
| 33 | +its type cannot be statically inferred. |
| 34 | + |
| 35 | +This also allows type checking for operations on series that contain date/time data. Consider |
| 36 | +the following example that creates two series of datetimes with corresponding arithmetic. |
| 37 | + |
| 38 | +```python |
| 39 | +s1 = pd.Series(pd.to_datetime(["2022-05-01", "2022-06-01"])) |
| 40 | +reveal_type(s1) |
| 41 | +s2 = pd.Series(pd.to_datetime(["2022-05-15", "2022-06-15"])) |
| 42 | +reveal_type(s2) |
| 43 | +td = s1 - s2 |
| 44 | +reveal_type(td) |
| 45 | +ssum = s1 + s2 |
| 46 | +reveal_type(ssum) |
| 47 | +``` |
| 48 | + |
| 49 | +The above code (without the `reveal_type()` statements) will raise an `Exception` on the computation of `ssum` because it is |
| 50 | +inappropriate to add two series containing `Timestamp` values. The types will be |
| 51 | +revealed as follows: |
| 52 | + |
| 53 | +```text |
| 54 | +ttest.py:4: note: Revealed type is "pandas.core.series.TimestampSeries" |
| 55 | +ttest.py:6: note: Revealed type is "pandas.core.series.TimestampSeries" |
| 56 | +ttest.py:8: note: Revealed type is "pandas.core.series.TimedeltaSeries" |
| 57 | +ttest.py:10: note: Revealed type is "builtins.Exception" |
| 58 | +``` |
| 59 | + |
| 60 | +The type `TimestampSeries` is the result of creating a series from `pd.to_datetime()`, while |
| 61 | +the type `TimedeltaSeries` is the result of subtracting two `TimestampSeries` as well as |
| 62 | +the result of `pd.to_timedelta()`. |
| 63 | + |
| 64 | +### Interval is Generic |
| 65 | + |
| 66 | +A pandas `Interval` can be a time interval, an interval of integers, or an interval of |
| 67 | +time, represented as having endpoints with the `Timestamp` class. pandas-stubs tracks |
| 68 | +the type of an `Interval`, based on the arguments passed to the `Interval` constructor. |
| 69 | +This allows detecting inappropriate operations, such as adding an integer to an |
| 70 | +interval of `Timestamp`s. |
| 71 | + |
| 72 | +## Testing the Type Stubs |
| 73 | + |
| 74 | +A set of (most likely incomplete) tests for testing the type stubs is in the pandas-stubs |
| 75 | +repository in the `tests` directory. The tests are used with `mypy` and `pyright` to |
| 76 | +validate correct typing, and also with `pytest` to validate that the provided code |
| 77 | +actually executes. The recent decision for Python 3.11 to include `assert_type()`, |
| 78 | +which is supported by `typing_extensions` version 4.2 and beyond makes it easier |
| 79 | +to test to validate the return types of functions and methods. Future work |
| 80 | +is intended to expand the use of `assert_type()` in the test code. |
| 81 | + |
| 82 | +## Narrow vs. Wide Arguments |
| 83 | + |
| 84 | +A consideration in creating stubs is too make the set of type annotations for |
| 85 | +function arguments "just right", i.e., |
| 86 | +not too narrow and not too wide. A type annotation to an argument to a function or |
| 87 | +method is too narrow if it disallows valid arguments. A type annotation to |
| 88 | +an argument to a function or method is too wide if |
| 89 | +it allows invalid arguments. Testing for type annotations that are too narrow is rather |
| 90 | +straightforward. It is easy to create an example for which the type checker indicates |
| 91 | +the argument is incorrect, and add it to the set of tests in the pandas-stubs |
| 92 | +repository after fixing the appropriate stub. However, testing for when type annotations |
| 93 | +are too wide is a bit more complicated. |
| 94 | +In this case, the test will fail when using `pytest`, but it is also desirable to |
| 95 | +have type checkers report errors for code that is expected to fail type checking. |
| 96 | + |
| 97 | +Here is an example that illustrates this concept, from `tests/test_interval.py`: |
| 98 | + |
| 99 | +```python |
| 100 | + i1 = pd.Interval( |
| 101 | + pd.Timestamp("2000-01-01"), pd.Timestamp("2000-01-03"), closed="both" |
| 102 | + ) |
| 103 | + if TYPE_CHECKING: |
| 104 | + i1 + pd.Timestamp("2000-03-03") # type: ignore |
| 105 | + |
| 106 | +``` |
| 107 | + |
| 108 | +In this particular example, the stubs consider that `i1` will have the type |
| 109 | +`pd.Interval[pd.Timestamp]`. It is incorrect code to add a `Timestamp` to a |
| 110 | +time-based interval. Without the `if TYPE_CHECKING` construct, the code would fail. |
| 111 | +However, it is also desirable to have the type checker pick up this failure, and by |
| 112 | +placing the `# type: ignore` on the line, an indication is made to the type checker |
| 113 | +that we expect this line to not pass the type checker. |
0 commit comments