Skip to content

DOC: Unclear description for header in read_csv docstring #53673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
tpaxman opened this issue Jun 14, 2023 · 2 comments
Open
1 task done

DOC: Unclear description for header in read_csv docstring #53673

tpaxman opened this issue Jun 14, 2023 · 2 comments
Assignees
Labels

Comments

@tpaxman
Copy link
Contributor

tpaxman commented Jun 14, 2023

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Documentation problem

I have noted the following Issues of clarity in the docstring description for read_csv for the header parameter:

  • The behavior associated with header=None is not explicitly defined.
  • Default behavior description is in terms of header=0 and header=None, neither of which have been clearly explained yet.
  • The relationship between file line numbers (which are conventionally numbered from 1) and row numbers/indices (which are indexed from 0) is not described explicity (only alluded to implicitly through examples of header=0 meaning the first line)
  • The description "if column names are passed explicitly" is vague as it doesn't explicitly mention how (i.e. via names parameter).
  • The detailed descriptions to not align with the order given in the inital list of accepted values

Suggested fix for documentation

Original docstring:

header : int, list of int, None, default 'infer'
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file. Issues of clarity noted in the docstring description:

Proposed change to address the issues:

header : int, list of int, None, default 'infer'
Index or indices corresponding to line number(s) in the CSV file that will be read as DataFrame column labels. Index 0 corresponds to the first line in the file (or the first non-blank, non-commented line if skip_blank_lines=True). The following arguments are valid:

  • Single int: denotes the line index at which column labels will be read.
  • List of int: denotes line indices at which column labels will be read as a multi-index. Note: intervening rows not specified in the list will be skipped (e.g., for header=[0,1,3], the line at index 2 will be skipped).
  • None: indicates that none of the lines in the file will be interpreted as headers and columns will instead be labelled by column index (or by values passed to the names parameter when provided). This is typically for files with no header. If the file has a header which the user intends to override with the names parameter, header should be assigned 0 instead of None.
  • 'infer' (default): behaves as header=0 if no names were passed, otherwise as header=None.

Note: this is more in line with how the read_excel function is described which would enhance consistency between the two similar functions as well. I would also like to propose making further edits to other parameter descriptions in the function but wanted to gauge support for this first by keying in on a specific example.

@tpaxman tpaxman added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2023
@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 14, 2023
@lithomas1
Copy link
Member

Thanks for opening this. PRs are welcome.

@tpaxman
Copy link
Contributor Author

tpaxman commented Jun 15, 2023

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants