-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Typ parts of c parser #44677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Typ parts of c parser #44677
Conversation
phofl
commented
Nov 29, 2021
- Ensure all linting tests pass, see here for how to run them
# Conflicts: # pandas/io/parsers/base_parser.py # pandas/io/parsers/c_parser_wrapper.py
@@ -34,7 +45,7 @@ class CParserWrapper(ParserBase): | |||
|
|||
def __init__( | |||
self, src: FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str], **kwds | |||
): | |||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is optional for __init__
.
Personally, I still like to add it. If mypy sees __init__
without None
as partially typed (I'm not sure whether that is the case), it would be good to keep None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer adding it to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have some (now outdated) style guidelines for the typing in https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#style-guidelines.
The not including the return type of __init__
was an unwritten style choice (IIRC preferred by @WillAyd)
If this is to be changed, we should be consistent and add it elsewhere (and update the style guidlines). I did have no strong preference originally but now that we have been not adding the redundant None
for so long, I would now prefer the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, removed it
) -> tuple[ | ||
Index | MultiIndex | None, | ||
Sequence[Hashable] | MultiIndex, | ||
Mapping[Hashable, ArrayLike], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return types should be as concrete as possible. If you know it is a list/dict, it probably shouldn't be Sequence/Mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic issue here is, if we type some functions as sequence, we also have to type the return types as sequence, because most of the time there is one code branch just passing the inputs along (for example _do_date_conversion). If we want to use lists, we get into a bunch of other issues. Not 100% sure what to do there. Went with Sequence for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW i usually use list/tuple/whatever on the theory that "well it's accurate and more specific than Sequence!" and then @simonjayhawkins tells me to use Sequence anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@simonjayhawkins should probably have a look at the Sequence vs list cases.
gentle ping @simonjayhawkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @twoertwein @simonjayhawkins if comments
pandas/io/parsers/base_parser.py
Outdated
def _evaluate_usecols(self, usecols, names): | ||
@overload | ||
def _evaluate_usecols( | ||
self, usecols: set[int] | Callable[[Hashable], int], names: Sequence[Hashable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the callable return a bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the return values are indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry should have been more clear.
I'm curious as to why usecols is a callable with an int
return type.
ParserBase.__init__
is not typed, so it is not clear to me why this is int
. bool
is type compatible here and surely any truthy value is valid. So the return type of the usecols callable is whatever the public api accepts. This is any object that can be truthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got you, yeah you are correct, this has to return a bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the docs for say read_csv https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.
So doesn't explicitly say it should be bool
, but why has int
been chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I minsinterpreted the return value of the callable as the return value of _evaluate_use_cols. Confused both with eacht other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. we do use bool
in many places in the public api where the api probably accepts any object that can be truthy. This may become an issue for users when the types are public and users start getting false positives when type checking.
so changing to bool is fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this could return anything, which evaluates to True/False. But since the docs say that this has to evaluate to True, I think we can type it like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the doc-string (I didn't look at the code), it should probably be Callable[[Hashable], object]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to object
thanks @phofl |