You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been reviewing the docstring for read_csv to add some clarifying edits in a PR and encountered some questions while reviewing the parameter descriptions. I found there were a few cases where the parameter types summary did not appear to align with the docstring specifications, as well as a few cases where the description of the types seems to differ from the type hints in the function signature.
I have made an attempt at describing those discrepancies here and have added a few questions in as well just to get clarification on how things are done. Once I get things figured out here, I'll plan to make any appropriate corrections to the docstring in a PR.
Parameters potentially needing edits
sep:
type hint: str | None | lib.NoDefault = lib.no_default
description: str, default ','
Why is the default not just ',' rather than lib.no_default?
delimiter:
type hint: str | None | lib.NoDefault = None
description: str, default None
It appears that default None should be replaced with str, optional
Also, I noticed for most parameters if the type hint includes lib.NoDefault then the default value is lib.no_default but this one has None instead. What is the reason this one is different? (And I have more questions about the no_default flag later on.)
header:
type hint: int | Sequence[int] | None | Literal["infer"] = "infer"
description: int, list of int, None, default 'infer'
I would expect 'infer' to occur in the list of options before being highlighted as the default, like so: int, list of int, 'infer' or None, default 'infer'
names:
type hint: Sequence[Hashable] | None | lib.NoDefault = lib.no_default
description: array-like, optional
Is there a reason it is described as array-like? It seems to only need to be a sequence of unique names which would be more like an ordered set, or for simplicity maybe just list-like? Or should it really be list of Hashable to be specific? Or even better Sequence of Hashable?
index_col:
type hint: IndexLabel | Literal[False] | None = None
description: int, str, sequence of int / str, or False, optional, default None
it appears the default None is redundant since optional is already specified, so I assume this can be removed, is that correct? Something like: int, str, list of int, list of str or False, optional
Also, I notice that list is used in the descriptions rather than sequence in most cases althoughsequence might be more accurate and descriptive. I'm wondering if there's a reason for that being used less often. This seems to be the only parameter that employs sequence in its description.
usecols:
type hint: list[HashableT] | Callable[[Hashable], bool] | None = None
description: list-like or callable, optional
Since the order of this parameter does not affect the result, and presumably only unique values are allowed, I am wondering if it would be clearer to describe it as set-like? List-like still seems appropriate but just wondering if it could be more specific to highlight the idea that order will not be preserved.
dtype:
type hint: DtypeArg | None = None
description: Type name or dict of column -> type, optional
From the docstring guide, it looks like the description of a dict should refer to its data types, not necessarily their meaning. And it says to use dict of {key : value} so would something like this be more appropriate? type, Hashable or dict of {Hashable : type or str}, optional
engine:
type hint: CSVEngine | None = None
description: {'c', 'python', 'pyarrow'}, optional
This description does not appear to specify the behavior when None is passed. Does it default to the 'c' engine? It seems like older versions used to say this, but I don't see it explicitly described in the current version. And if so, would it then make sense to define the type hint as folllow? CSVEngine = 'c'
converters:
type hint: Mapping[Hashable, Callable] | None = None
description: dict, optional
The docstring guide suggest this description should be: dict of {Hashable : Callable}
skiprows:
type hint: list[int] | int | Callable[[Hashable], bool] | None = None
description: list-like, int or callable, optional
This could be made more specific by changing to:: list of int, int or Callable, optional
na_values:
type hint: Sequence[str] | Mapping[str, Sequence[str]] | None = None
description: scalar, str, list-like, or dict, optional
The mention of scalar as a valid type is not clear to me; what would that be referring to? Also, it says that str is a valid input; however, the type hint does not include that as an option. What should be changed to resolve this?
parse_dates:
type hint: bool | Sequence[Hashable] | None = None
description: bool or list of int or names or list of lists or dict, default False
This also appears to not align with the type hint which says you can only assign either a bool or a list. Also the default seems to actually be None, yet the description says it's False. Looking for clarification here.
infer_datetime_format:
type hint: bool | lib.NoDefault = lib.no_default
description: bool, default False
It appears that the default is technically lib.no_default but descibed to be False. What is the reason for doing this rather than assigning its default as False, such as how keep_date_col is defined?
date_parser:
type hint: Callable | lib.NoDefault = lib.no_default
description: function, optional
Could be updated to use a specific type, i.e.: Callable, optional
Also, what is the reason for using lib.no_default instead of just None?
date_format:
type hint: str | None = None
description: str or dict of column -> format, default None
The type hint does not appear to match the description here as there is no mention of dict being a valid argument. Is the description out of date or vice versa?
If the description is accurate, then Similar to dtype, this could be updated as: str or dict of {Hashable : str}, optional
compression:
type hint: CompressionOptions = "infer"
description: str or dict, default 'infer'
Could the dict in the description be updated as dict of {k : v}? If so, what would the valid types be for k and v, as I was not totally clear about it from the description.
thousands:
type hint: str | None = None
description: str, optional
I assume this should be a str (length 1) like some other parameters. Should that be updated?
decimal:
type hint: str = "."
description: str, default '.'
I assume this should be a str (length 1) like some other parameters. Should that be updated?
quoting:
type hint: int = csv.QUOTE_MINIMAL
description: int or csv.QUOTE_* instance, default 0
It would make sense to add some context to the default here, maybe something like: int or csv.QUOTE_* instance, default 0 meaning csv.QUOTE_MINIMAL
comment:
type hint: str | None = None
description: str, optional
The description says this should be a single character so it should be changed to: str (length 1), optional
encoding:
type hint: str | None = None
description: str, optional, default "utf-8"
The default is technically None but the description says optional, default 'utf-8'. Is there a reason the default is not just 'utf-8' instead of None?
encoding_errors:
type hint: str | None = "strict"
description: str, optional, default "strict"
I am wondering why this is both optional and has a default. What is the behavior when None is passed?
on_bad_lines:
type hint: str = "error"
description: {'error', 'warn', 'skip'} or callable, default 'error'
The type hint does not mention Callable as an option as described in the docstring. Does this need to be updated?
float_precision:
type hint: Literal["high", "legacy"] | None = None
description: str, optional
It looks like the description should not be the generic str, but rather: {'high', 'legacy'}, optional
dtype_backend:
type hint: DtypeBackend | lib.NoDefault = lib.no_default
description: {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames
Given that this defaults to NumPy backed DataFrames, why is lib.no_default used rather than just `'numpy_nullable'*?
General Questions:
Here is a summary of the most common questions that came up while reviewing the parameters:
Why are some 'optional' parameters not assigned None and instead use lib.no_default (e.g., delimiter, index_col, date_format)?
Why are some values with defined default behavior not assigned a literal that would trigger that behaviour but instead use None or lib.no_default as a flag (e.g., encoding, dtype_backend)?
What is the appropriate way to describe a 'list-like' object? Is there a best practice for using 'list of ' vs. 'list-like' vs. 'array-like' vs. 'sequence`?
Suggested fix for documentation
My suggested fix would be to create a PR where I update each parameter description as described in the Documentation Problem section such that the descriptions align with both the docstring guide and the function signature's type hints in every case.
The text was updated successfully, but these errors were encountered:
lib.no_default is used as our default so we can tell if an user is passing None. It is treated like an user passed None, so the docstring is correct, and I don't think anything should be changed there.
(this is useful for e.g. deprecations)
Pandas version checks
main
hereLocation of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Documentation problem
I have been reviewing the docstring for
read_csv
to add some clarifying edits in a PR and encountered some questions while reviewing the parameter descriptions. I found there were a few cases where the parameter types summary did not appear to align with the docstring specifications, as well as a few cases where the description of the types seems to differ from the type hints in the function signature.I have made an attempt at describing those discrepancies here and have added a few questions in as well just to get clarification on how things are done. Once I get things figured out here, I'll plan to make any appropriate corrections to the docstring in a PR.
Parameters potentially needing edits
sep:
str | None | lib.NoDefault = lib.no_default
Why is the default not just
','
rather thanlib.no_default
?delimiter:
str | None | lib.NoDefault = None
It appears that default None should be replaced with str, optional
Also, I noticed for most parameters if the type hint includes
lib.NoDefault
then the default value islib.no_default
but this one hasNone
instead. What is the reason this one is different? (And I have more questions about theno_default
flag later on.)header:
int | Sequence[int] | None | Literal["infer"] = "infer"
I would expect 'infer' to occur in the list of options before being highlighted as the default, like so:
int, list of int, 'infer' or None, default 'infer'
names:
Sequence[Hashable] | None | lib.NoDefault = lib.no_default
Is there a reason it is described as array-like? It seems to only need to be a sequence of unique names which would be more like an ordered set, or for simplicity maybe just list-like? Or should it really be list of Hashable to be specific? Or even better Sequence of Hashable?
index_col:
IndexLabel | Literal[False] | None = None
it appears the default None is redundant since optional is already specified, so I assume this can be removed, is that correct? Something like:
int, str, list of int, list of str or False, optional
Also, I notice that list is used in the descriptions rather than sequence in most cases althoughsequence might be more accurate and descriptive. I'm wondering if there's a reason for that being used less often. This seems to be the only parameter that employs sequence in its description.
usecols:
list[HashableT] | Callable[[Hashable], bool] | None = None
Since the order of this parameter does not affect the result, and presumably only unique values are allowed, I am wondering if it would be clearer to describe it as set-like? List-like still seems appropriate but just wondering if it could be more specific to highlight the idea that order will not be preserved.
dtype:
DtypeArg | None = None
From the docstring guide, it looks like the description of a dict should refer to its data types, not necessarily their meaning. And it says to use dict of {key : value} so would something like this be more appropriate?
type, Hashable or dict of {Hashable : type or str}, optional
engine:
CSVEngine | None = None
This description does not appear to specify the behavior when
None
is passed. Does it default to the 'c' engine? It seems like older versions used to say this, but I don't see it explicitly described in the current version. And if so, would it then make sense to define the type hint as folllow?CSVEngine = 'c'
converters:
Mapping[Hashable, Callable] | None = None
The docstring guide suggest this description should be:
dict of {Hashable : Callable}
skiprows:
list[int] | int | Callable[[Hashable], bool] | None = None
This could be made more specific by changing to::
list of int, int or Callable, optional
na_values:
Sequence[str] | Mapping[str, Sequence[str]] | None = None
The mention of scalar as a valid type is not clear to me; what would that be referring to? Also, it says that str is a valid input; however, the type hint does not include that as an option. What should be changed to resolve this?
parse_dates:
bool | Sequence[Hashable] | None = None
This also appears to not align with the type hint which says you can only assign either a bool or a list. Also the default seems to actually be
None
, yet the description says it'sFalse
. Looking for clarification here.infer_datetime_format:
bool | lib.NoDefault = lib.no_default
It appears that the default is technically
lib.no_default
but descibed to beFalse
. What is the reason for doing this rather than assigning its default asFalse
, such as howkeep_date_col
is defined?date_parser:
Callable | lib.NoDefault = lib.no_default
Could be updated to use a specific type, i.e.:
Callable, optional
Also, what is the reason for using
lib.no_default
instead of justNone
?date_format:
str | None = None
The type hint does not appear to match the description here as there is no mention of
dict
being a valid argument. Is the description out of date or vice versa?If the description is accurate, then Similar to
dtype
, this could be updated as:str or dict of {Hashable : str}, optional
compression:
CompressionOptions = "infer"
Could the
dict
in the description be updated asdict of {k : v}
? If so, what would the valid types be for k and v, as I was not totally clear about it from the description.thousands:
str | None = None
I assume this should be a str (length 1) like some other parameters. Should that be updated?
decimal:
str = "."
I assume this should be a str (length 1) like some other parameters. Should that be updated?
quoting:
int = csv.QUOTE_MINIMAL
It would make sense to add some context to the default here, maybe something like:
int or csv.QUOTE_* instance, default 0 meaning csv.QUOTE_MINIMAL
comment:
str | None = None
The description says this should be a single character so it should be changed to:
str (length 1), optional
encoding:
str | None = None
The default is technically
None
but the description says optional, default 'utf-8'. Is there a reason the default is not just 'utf-8' instead of None?encoding_errors:
str | None = "strict"
I am wondering why this is both optional and has a default. What is the behavior when
None
is passed?on_bad_lines:
str = "error"
The type hint does not mention Callable as an option as described in the docstring. Does this need to be updated?
float_precision:
Literal["high", "legacy"] | None = None
It looks like the description should not be the generic
str
, but rather:{'high', 'legacy'}, optional
dtype_backend:
DtypeBackend | lib.NoDefault = lib.no_default
Given that this defaults to NumPy backed DataFrames, why is
lib.no_default
used rather than just `'numpy_nullable'*?General Questions:
Here is a summary of the most common questions that came up while reviewing the parameters:
None
and instead uselib.no_default
(e.g.,delimiter
,index_col
,date_format
)?None
orlib.no_default
as a flag (e.g.,encoding
,dtype_backend
)?Suggested fix for documentation
My suggested fix would be to create a PR where I update each parameter description as described in the Documentation Problem section such that the descriptions align with both the docstring guide and the function signature's type hints in every case.
The text was updated successfully, but these errors were encountered: