-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TYP: Add type hints to pd.read_html #34291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/html.py
Outdated
flavor=None, | ||
io: FilePathOrBuffer, | ||
match: Union[str, Pattern] = ".+", | ||
flavor: Optional[str] = None, | ||
header=None, | ||
index_col=None, | ||
skiprows=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above params are list-likes, which are not expressable in the type system, AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fine to use something like Optional[Union[int, Sequence[int]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good @simonjayhawkins
pandas/io/html.py
Outdated
flavor=None, | ||
io: FilePathOrBuffer, | ||
match: Union[str, Pattern] = ".+", | ||
flavor: Optional[str] = None, | ||
header=None, | ||
index_col=None, | ||
skiprows=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fine to use something like Optional[Union[int, Sequence[int]]
This would make mypy complain about arraylikes, .e.g. ndarray[int] (which doesn''t work anyway), so force people to convert those to e.g. lists to satisfy mypy. I think I'd actually be ok with that where the sequence is expected to be small (for example in read_html), bu not everywhere (e.g. we wouldn't want to convert arrays to lists in |
I think should use the Sequence annotation here. You are correct that there is work to be done from a larger API perspective, but I don't think that changes things here for now |
28f79d1
to
c968d3b
Compare
deprecate_nonkeyword_arguments is not preserving the signature, so these annotation don't add much value at the moment and mypy can't do much consistency checking.
why not Iterable? the docs state list-like. we can't make braking change to the api, although I'm not sure exactly what list-like means from an end user perspective. from the docstring of is_list_like in pandas_libs\lib.pyx Objects that are considered list-like are for example Python so IMO we need to include these at least.
so would be Optional[Union[int, Iterable[int], AnyArrayLike] for now and in-time when we make Series, Index and ExtensionArray Generic classes this would become Optional[Union[int, Iterable[int], AnyArrayLike[int]] unless we ensure that AnyArrayLike[int] is covered by Iterable[int] |
pandas/io/html.py
Outdated
@@ -964,14 +967,14 @@ def read_html( | |||
default of ``None`` tries to use ``lxml`` to parse and if that fails it | |||
falls back on ``bs4`` + ``html5lib``. | |||
|
|||
header : int or list-like or None, optional | |||
header : int or sequence of ints, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we should should change the user facing api. see comment.
pandas/io/html.py
Outdated
index_col: Union[int, Sequence[int], None] = None, | ||
skiprows: Union[int, Sequence[int], slice, None] = None, | ||
attrs: Optional[Dict[str, str]] = None, | ||
parse_dates: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, should be just bool. Changed.
can you rebase; @simonjayhawkins if you can review. |
I'm happy with the type annotations. Just need discussion on docstring changes. IIUC we use different nomenclature in the docstrings for types. |
Rebased and updated. @simonjayhawkins , I don't understand you wrt. the doc string, could you expand? |
ping. |
numpy has a glossary https://numpy.org/devdocs/glossary.html#term-array-like for pandas, i'm not exactly sure what list-like is, but we use it a lot in our documentation. This PR is changing it in only one place. We could switch to using sequence, if it is understood to be equivalent to typing.Sequence, using it everywhere where list-like is used where any sequence is accepted, and having proper testing. Although, I'm sure (but could be wrong) that in many places where we use list-like an iterable is accepted and therefore by changing the docstring, the documented api is changing. As I said before, the typing annotations are fine. I'm just not sure about the docstring changes. @jorisvandenbossche wdyt? are the docstring changes OK? |
The new doc strings mirror the annotations, so are correct, if the annotations are correct? My understanding of list-like is anything that passes |
without a definition for a pandas user, I would use this too as the definition. but there is then a discrepancy with sequence.
no guarantee that the annotations added are consistent with the rest of the codebase (or with the use of read_html in the wild). (as an aside we have been tripped up a few times with this, the most recent I'm aware of being #34741) deprecate_nonkeyword_arguments is not preserving the signature (no type annotations on deprecate_nonkeyword_arguments decorator) and read_html is not doing much apart from passing the parameters to _parse (which is not typed) so these annotation don't add much value at the moment and mypy can't do much consistency checking. I'm OK with the annotations as they stand though, since mypy is green and as we add more types, mypy will reveal any inconsistencies. we need to be carefull to keep the addition of type annotations and potential api changes separate. |
Regarding the discussion on "sequence" vs "list-like" in docstrings: personally I think it is more important to see that we consistently use a term in the docstrings, than to have the annotation and docstring match exactly (in the end, the docstring is meant to be user-readable, the annotation is often not). So meaning: if we almost anywhere use the term "list-like" and not "sequence" in the docstrings (I don't know if this is true, just assuming here), then I would use list-like here as well. But eg in the docstring guidelines, both are not mentioned (https://pandas.pydata.org/docs/dev/development/contributing_docstring.html#parameter-types, it mentions "iterable" and "array-like"). It would be nice to standardize this, make a glossary etc, but that's for a different issue (so for this PR I would just leave the docstring as it was and move on) |
I'm ok with leaving the doc string as is, but I've made some clarifications on them though. Updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @topper-123
pandas/io/html.py
Outdated
thousands: Optional[str] = ",", | ||
encoding: Optional[str] = None, | ||
decimal: str = ".", | ||
converters: Optional[dict] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict -> Dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of PEP 585 this is IMO ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is agreement on #30539 (comment), this would fail CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing those from typing is deprecated. Due to PEP 563 and the intention to minimize the runtime impact of typing, this deprecation will not generate DeprecationWarnings. Instead, type checkers may warn about such deprecated usage when the target version of the checked program is signalled to be Python 3.9 or newer. It's recommended to allow for those warnings to be silenced on a project-wide basis.
The deprecated functionality will be removed from the typing module in the first Python version released 5 years after the release of Python 3.9.0.
i.e. importing type hints from typing (including typing.Dict) will be deprecated in 3.9. So does not look like a win to enforce that kind of type hints in pandas IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good reference but at a minimum can only be a future import in Python 3.7, so I think should stick with the current convention at least until we drop 3.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between the two return signatures in 3.6? I’ve thought they’re equivalent, also in 3.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict is equivalent to Dict[Any]
or also Dict
; obviously with dict you can't subscript until the timelines noted by the PEP you linked
We import from typing in about 99% of cases so should at least stick to that convention; if subtypes are possible here they make things even better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've updated the PR.
pandas/io/html.py
Outdated
io: FilePathOrBuffer, | ||
match: Union[str, Pattern] = ".+", | ||
flavor: Optional[str] = None, | ||
header: Union[int, Sequence[int], None] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union[..., ..., None] -> Optional[Union[..., ...]]
ping. |
thanks @topper-123 |
No description provided.