DOC: discrepancies in read_csv docstring between docstring guide or type hints #53763

tpaxman · 2023-06-21T05:43:46Z

Pandas version checks

I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Documentation problem

I have been reviewing the docstring for read_csv to add some clarifying edits in a PR and encountered some questions while reviewing the parameter descriptions. I found there were a few cases where the parameter types summary did not appear to align with the docstring specifications, as well as a few cases where the description of the types seems to differ from the type hints in the function signature.

I have made an attempt at describing those discrepancies here and have added a few questions in as well just to get clarification on how things are done. Once I get things figured out here, I'll plan to make any appropriate corrections to the docstring in a PR.

Parameters potentially needing edits

sep:

type hint: str | None | lib.NoDefault = lib.no_default
description: str, default ','

Why is the default not just ',' rather than lib.no_default?

delimiter:

type hint: str | None | lib.NoDefault = None
description: str, default None

It appears that default None should be replaced with str, optional

Also, I noticed for most parameters if the type hint includes lib.NoDefault then the default value is lib.no_default but this one has None instead. What is the reason this one is different? (And I have more questions about the no_default flag later on.)

header:

type hint: int | Sequence[int] | None | Literal["infer"] = "infer"
description: int, list of int, None, default 'infer'

I would expect 'infer' to occur in the list of options before being highlighted as the default, like so:
int, list of int, 'infer' or None, default 'infer'

names:

type hint: Sequence[Hashable] | None | lib.NoDefault = lib.no_default
description: array-like, optional

Is there a reason it is described as array-like? It seems to only need to be a sequence of unique names which would be more like an ordered set, or for simplicity maybe just list-like? Or should it really be list of Hashable to be specific? Or even better Sequence of Hashable?

index_col:

type hint: IndexLabel | Literal[False] | None = None
description: int, str, sequence of int / str, or False, optional, default None

it appears the default None is redundant since optional is already specified, so I assume this can be removed, is that correct? Something like:
int, str, list of int, list of str or False, optional

Also, I notice that list is used in the descriptions rather than sequence in most cases althoughsequence might be more accurate and descriptive. I'm wondering if there's a reason for that being used less often. This seems to be the only parameter that employs sequence in its description.

usecols:

type hint: list[HashableT] | Callable[[Hashable], bool] | None = None
description: list-like or callable, optional

Since the order of this parameter does not affect the result, and presumably only unique values are allowed, I am wondering if it would be clearer to describe it as set-like? List-like still seems appropriate but just wondering if it could be more specific to highlight the idea that order will not be preserved.

dtype:

type hint: DtypeArg | None = None
description: Type name or dict of column -> type, optional

From the docstring guide, it looks like the description of a dict should refer to its data types, not necessarily their meaning. And it says to use dict of {key : value} so would something like this be more appropriate?
type, Hashable or dict of {Hashable : type or str}, optional

engine:

type hint: CSVEngine | None = None
description: {'c', 'python', 'pyarrow'}, optional

This description does not appear to specify the behavior when None is passed. Does it default to the 'c' engine? It seems like older versions used to say this, but I don't see it explicitly described in the current version. And if so, would it then make sense to define the type hint as folllow?
CSVEngine = 'c'

converters:

type hint: Mapping[Hashable, Callable] | None = None
description: dict, optional

The docstring guide suggest this description should be:
dict of {Hashable : Callable}

skiprows:

type hint: list[int] | int | Callable[[Hashable], bool] | None = None
description: list-like, int or callable, optional

This could be made more specific by changing to::
list of int, int or Callable, optional

na_values:

type hint: Sequence[str] | Mapping[str, Sequence[str]] | None = None
description: scalar, str, list-like, or dict, optional

The mention of scalar as a valid type is not clear to me; what would that be referring to? Also, it says that str is a valid input; however, the type hint does not include that as an option. What should be changed to resolve this?

parse_dates:

type hint: bool | Sequence[Hashable] | None = None
description: bool or list of int or names or list of lists or dict, default False

This also appears to not align with the type hint which says you can only assign either a bool or a list. Also the default seems to actually be None, yet the description says it's False. Looking for clarification here.

infer_datetime_format:

type hint: bool | lib.NoDefault = lib.no_default
description: bool, default False

It appears that the default is technically lib.no_default but descibed to be False. What is the reason for doing this rather than assigning its default as False, such as how keep_date_col is defined?

date_parser:

type hint: Callable | lib.NoDefault = lib.no_default
description: function, optional

Could be updated to use a specific type, i.e.:
Callable, optional

Also, what is the reason for using lib.no_default instead of just None?

date_format:

type hint: str | None = None
description: str or dict of column -> format, default None

The type hint does not appear to match the description here as there is no mention of dict being a valid argument. Is the description out of date or vice versa?

If the description is accurate, then Similar to dtype, this could be updated as:
str or dict of {Hashable : str}, optional

compression:

type hint: CompressionOptions = "infer"
description: str or dict, default 'infer'

Could the dict in the description be updated as dict of {k : v}? If so, what would the valid types be for k and v, as I was not totally clear about it from the description.

thousands:

type hint: str | None = None
description: str, optional

I assume this should be a str (length 1) like some other parameters. Should that be updated?

decimal:

type hint: str = "."
description: str, default '.'

I assume this should be a str (length 1) like some other parameters. Should that be updated?

quoting:

type hint: int = csv.QUOTE_MINIMAL
description: int or csv.QUOTE_* instance, default 0

It would make sense to add some context to the default here, maybe something like:
int or csv.QUOTE_* instance, default 0 meaning csv.QUOTE_MINIMAL

comment:

type hint: str | None = None
description: str, optional

The description says this should be a single character so it should be changed to:
str (length 1), optional

encoding:

type hint: str | None = None
description: str, optional, default "utf-8"

The default is technically None but the description says optional, default 'utf-8'. Is there a reason the default is not just 'utf-8' instead of None?

encoding_errors:

type hint: str | None = "strict"
description: str, optional, default "strict"

I am wondering why this is both optional and has a default. What is the behavior when None is passed?

on_bad_lines:

type hint: str = "error"
description: {'error', 'warn', 'skip'} or callable, default 'error'

The type hint does not mention Callable as an option as described in the docstring. Does this need to be updated?

float_precision:

type hint: Literal["high", "legacy"] | None = None
description: str, optional

It looks like the description should not be the generic str, but rather:
{'high', 'legacy'}, optional

dtype_backend:

type hint: DtypeBackend | lib.NoDefault = lib.no_default
description: {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames

Given that this defaults to NumPy backed DataFrames, why is lib.no_default used rather than just `'numpy_nullable'*?

General Questions:

Here is a summary of the most common questions that came up while reviewing the parameters:

Why are some 'optional' parameters not assigned None and instead use lib.no_default (e.g., delimiter, index_col, date_format)?
Why are some values with defined default behavior not assigned a literal that would trigger that behaviour but instead use None or lib.no_default as a flag (e.g., encoding, dtype_backend)?
What is the appropriate way to describe a 'list-like' object? Is there a best practice for using 'list of ' vs. 'list-like' vs. 'array-like' vs. 'sequence`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: discrepancies in read_csv docstring between docstring guide or type hints #53763

DOC: discrepancies in read_csv docstring between docstring guide or type hints #53763

tpaxman commented Jun 21, 2023 •

edited

Loading

lithomas1 commented Jun 22, 2023

DOC: discrepancies in read_csv docstring between docstring guide or type hints #53763

DOC: discrepancies in read_csv docstring between docstring guide or type hints #53763

Comments

tpaxman commented Jun 21, 2023 • edited Loading

Pandas version checks

Location of the documentation

Documentation problem

Parameters potentially needing edits

sep:

delimiter:

header:

names:

index_col:

usecols:

dtype:

engine:

converters:

skiprows:

na_values:

parse_dates:

infer_datetime_format:

date_parser:

date_format:

compression:

thousands:

decimal:

quoting:

comment:

encoding:

encoding_errors:

on_bad_lines:

float_precision:

dtype_backend:

General Questions:

Suggested fix for documentation

lithomas1 commented Jun 22, 2023

tpaxman commented Jun 21, 2023 •

edited

Loading