Skip to content

DOC: discrepancies in read_csv docstring between docstring guide or type hints #53763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
tpaxman opened this issue Jun 21, 2023 · 1 comment · Fixed by #53834
Closed
1 task done

DOC: discrepancies in read_csv docstring between docstring guide or type hints #53763

tpaxman opened this issue Jun 21, 2023 · 1 comment · Fixed by #53834
Labels

Comments

@tpaxman
Copy link
Contributor

tpaxman commented Jun 21, 2023

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Documentation problem

I have been reviewing the docstring for read_csv to add some clarifying edits in a PR and encountered some questions while reviewing the parameter descriptions. I found there were a few cases where the parameter types summary did not appear to align with the docstring specifications, as well as a few cases where the description of the types seems to differ from the type hints in the function signature.

I have made an attempt at describing those discrepancies here and have added a few questions in as well just to get clarification on how things are done. Once I get things figured out here, I'll plan to make any appropriate corrections to the docstring in a PR.

Parameters potentially needing edits

sep:

  • type hint: str | None | lib.NoDefault = lib.no_default
  • description: str, default ','

Why is the default not just ',' rather than lib.no_default?

delimiter:

  • type hint: str | None | lib.NoDefault = None
  • description: str, default None

It appears that default None should be replaced with str, optional

Also, I noticed for most parameters if the type hint includes lib.NoDefault then the default value is lib.no_default but this one has None instead. What is the reason this one is different? (And I have more questions about the no_default flag later on.)

header:

  • type hint: int | Sequence[int] | None | Literal["infer"] = "infer"
  • description: int, list of int, None, default 'infer'

I would expect 'infer' to occur in the list of options before being highlighted as the default, like so:
int, list of int, 'infer' or None, default 'infer'

names:

  • type hint: Sequence[Hashable] | None | lib.NoDefault = lib.no_default
  • description: array-like, optional

Is there a reason it is described as array-like? It seems to only need to be a sequence of unique names which would be more like an ordered set, or for simplicity maybe just list-like? Or should it really be list of Hashable to be specific? Or even better Sequence of Hashable?

index_col:

  • type hint: IndexLabel | Literal[False] | None = None
  • description: int, str, sequence of int / str, or False, optional, default None

it appears the default None is redundant since optional is already specified, so I assume this can be removed, is that correct? Something like:
int, str, list of int, list of str or False, optional

Also, I notice that list is used in the descriptions rather than sequence in most cases althoughsequence might be more accurate and descriptive. I'm wondering if there's a reason for that being used less often. This seems to be the only parameter that employs sequence in its description.

usecols:

  • type hint: list[HashableT] | Callable[[Hashable], bool] | None = None
  • description: list-like or callable, optional

Since the order of this parameter does not affect the result, and presumably only unique values are allowed, I am wondering if it would be clearer to describe it as set-like? List-like still seems appropriate but just wondering if it could be more specific to highlight the idea that order will not be preserved.

dtype:

  • type hint: DtypeArg | None = None
  • description: Type name or dict of column -> type, optional

From the docstring guide, it looks like the description of a dict should refer to its data types, not necessarily their meaning. And it says to use dict of {key : value} so would something like this be more appropriate?
type, Hashable or dict of {Hashable : type or str}, optional

engine:

  • type hint: CSVEngine | None = None
  • description: {'c', 'python', 'pyarrow'}, optional

This description does not appear to specify the behavior when None is passed. Does it default to the 'c' engine? It seems like older versions used to say this, but I don't see it explicitly described in the current version. And if so, would it then make sense to define the type hint as folllow?
CSVEngine = 'c'

converters:

  • type hint: Mapping[Hashable, Callable] | None = None
  • description: dict, optional

The docstring guide suggest this description should be:
dict of {Hashable : Callable}

skiprows:

  • type hint: list[int] | int | Callable[[Hashable], bool] | None = None
  • description: list-like, int or callable, optional

This could be made more specific by changing to::
list of int, int or Callable, optional

na_values:

  • type hint: Sequence[str] | Mapping[str, Sequence[str]] | None = None
  • description: scalar, str, list-like, or dict, optional

The mention of scalar as a valid type is not clear to me; what would that be referring to? Also, it says that str is a valid input; however, the type hint does not include that as an option. What should be changed to resolve this?

parse_dates:

  • type hint: bool | Sequence[Hashable] | None = None
  • description: bool or list of int or names or list of lists or dict, default False

This also appears to not align with the type hint which says you can only assign either a bool or a list. Also the default seems to actually be None, yet the description says it's False. Looking for clarification here.

infer_datetime_format:

  • type hint: bool | lib.NoDefault = lib.no_default
  • description: bool, default False

It appears that the default is technically lib.no_default but descibed to be False. What is the reason for doing this rather than assigning its default as False, such as how keep_date_col is defined?

date_parser:

  • type hint: Callable | lib.NoDefault = lib.no_default
  • description: function, optional

Could be updated to use a specific type, i.e.:
Callable, optional

Also, what is the reason for using lib.no_default instead of just None?

date_format:

  • type hint: str | None = None
  • description: str or dict of column -> format, default None

The type hint does not appear to match the description here as there is no mention of dict being a valid argument. Is the description out of date or vice versa?

If the description is accurate, then Similar to dtype, this could be updated as:
str or dict of {Hashable : str}, optional

compression:

  • type hint: CompressionOptions = "infer"
  • description: str or dict, default 'infer'

Could the dict in the description be updated as dict of {k : v}? If so, what would the valid types be for k and v, as I was not totally clear about it from the description.

thousands:

  • type hint: str | None = None
  • description: str, optional

I assume this should be a str (length 1) like some other parameters. Should that be updated?

decimal:

  • type hint: str = "."
  • description: str, default '.'

I assume this should be a str (length 1) like some other parameters. Should that be updated?

quoting:

  • type hint: int = csv.QUOTE_MINIMAL
  • description: int or csv.QUOTE_* instance, default 0

It would make sense to add some context to the default here, maybe something like:
int or csv.QUOTE_* instance, default 0 meaning csv.QUOTE_MINIMAL

comment:

  • type hint: str | None = None
  • description: str, optional

The description says this should be a single character so it should be changed to:
str (length 1), optional

encoding:

  • type hint: str | None = None
  • description: str, optional, default "utf-8"

The default is technically None but the description says optional, default 'utf-8'. Is there a reason the default is not just 'utf-8' instead of None?

encoding_errors:

  • type hint: str | None = "strict"
  • description: str, optional, default "strict"

I am wondering why this is both optional and has a default. What is the behavior when None is passed?

on_bad_lines:

  • type hint: str = "error"
  • description: {'error', 'warn', 'skip'} or callable, default 'error'

The type hint does not mention Callable as an option as described in the docstring. Does this need to be updated?

float_precision:

  • type hint: Literal["high", "legacy"] | None = None
  • description: str, optional

It looks like the description should not be the generic str, but rather:
{'high', 'legacy'}, optional

dtype_backend:

  • type hint: DtypeBackend | lib.NoDefault = lib.no_default
  • description: {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames

Given that this defaults to NumPy backed DataFrames, why is lib.no_default used rather than just `'numpy_nullable'*?

General Questions:

Here is a summary of the most common questions that came up while reviewing the parameters:

  • Why are some 'optional' parameters not assigned None and instead use lib.no_default (e.g., delimiter, index_col, date_format)?
  • Why are some values with defined default behavior not assigned a literal that would trigger that behaviour but instead use None or lib.no_default as a flag (e.g., encoding, dtype_backend)?
  • What is the appropriate way to describe a 'list-like' object? Is there a best practice for using 'list of ' vs. 'list-like' vs. 'array-like' vs. 'sequence`?

Suggested fix for documentation

My suggested fix would be to create a PR where I update each parameter description as described in the Documentation Problem section such that the descriptions align with both the docstring guide and the function signature's type hints in every case.

@tpaxman tpaxman added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 21, 2023
@lithomas1
Copy link
Member

lib.no_default is used as our default so we can tell if an user is passing None. It is treated like an user passed None, so the docstring is correct, and I don't think anything should be changed there.
(this is useful for e.g. deprecations)

You can open a PR for the rest though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants