Skip to content

DOC: clarify how read_csv nrows interacts with header and skiprows argument #59078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
mdavis-xyz opened this issue Jun 24, 2024 · 3 comments · Fixed by #59467
Closed
1 task done

DOC: clarify how read_csv nrows interacts with header and skiprows argument #59078

mdavis-xyz opened this issue Jun 24, 2024 · 3 comments · Fixed by #59467
Assignees
Labels

Comments

@mdavis-xyz
Copy link
Contributor

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Documentation problem

The documentation for read_csv's nrows argument says:

Number of rows of file to read. Useful for reading pieces of large files.

I want to read a file using header=1, and then limit the number of rows. The documentation says this counts the number of rows of the file. To me that sounds like it includes the skipped row and the column header row, since pandas still reads those rows from the file. But I've done some testing. The nrows argument counts the number of data rows. It excludes the skipped rows, and excludes the column header row. skiprows is the same (skipped rows aren't counted towards nrows). When I have a row which is a comment, that also doesn't count towards nrows.

import pandas as pd
csv = """extra,
a,b
1,1
#comment,comment
2,2
3,3
footer,blah,yeah
"""
from io import StringIO
with StringIO(csv) as io:
    df = pd.read_csv(io, header=1, nrows=2, comment='#')

For nrows=2, it seems to always return 2 rows.

Suggested fix for documentation

Number of rows of data to read. Useful for reading pieces of large files. Refers to the number of included data rows. The following rows are not included in the count:

  • the column header
  • rows before the column header, if header=1 or larger
  • rows which are fully comment rows
  • rows skipped with skiprows
@mdavis-xyz mdavis-xyz added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2024
@Aloqeely Aloqeely removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 24, 2024
@Aloqeely
Copy link
Member

Thanks for the suggestion. PRs are welcomed!

@mdavis-xyz
Copy link
Contributor Author

Ok. I'll do that in the next week.

Note to myself: I need to test how rows with quoted newlines within string cells are counted.

@johnyu013
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants