Skip to content

DOC: Add TextFileReader to docs #46308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions doc/source/reference/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,19 @@ Flat file
DataFrame.to_csv
read_fwf

.. currentmodule:: pandas.io.parsers

.. autosummary::
:toctree: api/

TextFileReader

TextFileReader.get_chunk
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We obviously don't want to add these. But when I remove them, sphinx complains that it can not find them in any toctree. Anyone any ideas how to solve this? For ExcelWriter below this works, so I am probably missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the past we used a section with :hidden:, like in https://github.com/pandas-dev/pandas/blame/main/doc/source/getting_started/index.rst#L641

But I don't see it being used for the API anymore. I guess we should just make them private by using _get_chunk... if we don't want them public and in the documentation.

Or maybe just _TextFileReader and make the whole class private if we don't want it being part of our public API.

I'm personally fine with any of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could deprecate it and remove it later, that sounds good to me.

We can't makeTextFileReader private, since it is returned by read_csv if you are reading the file in chunks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true. Sounds good to me then. Also find to simply document the methods and publish them in the docs. Maybe we can just add a not for now that we recommend using the magic methods way.

TextFileReader.close
TextFileReader.read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close and read can probably be public but more important are the magic methods __enter__, __exit__, and __next__. Ideally, people interact with TextFileReader in this manner:

with pd.read_csv("test.csv", iterator=True) as reader:
    for chunk in reader:
        print(chunk)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to add a doc-string to read (and maybe also to close)


.. currentmodule:: pandas

Clipboard
~~~~~~~~~
.. autosummary::
Expand Down
3 changes: 2 additions & 1 deletion pandas/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@
from pandas.io import (
formats,
json,
parsers,
stata,
)

# mark only those modules as public
__all__ = ["formats", "json", "stata"]
__all__ = ["formats", "json", "parsers", "stata"]
27 changes: 26 additions & 1 deletion pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1376,9 +1376,21 @@ def read_fwf(

class TextFileReader(abc.Iterator):
"""
Passed dialect overrides any of the related parser options.

Passed dialect overrides any of the related parser options
Iterator class used to process to text files read via read_csv in
chunks.

An instance of this class is returned by `read_csv` when it is processed in
chunks, instead of returning a single `DataFrame`.

When iterating over `TextFileReader`, every item returned will be a DataFrame.

Examples
---------
>>> with pd.read_csv(..., iterator=True) as text_file_reader:
... for df in text_file_reader:
... ...
"""

def __init__(
Expand Down Expand Up @@ -1429,6 +1441,7 @@ def __init__(
self._engine = self._make_engine(f, self.engine)

def close(self) -> None:
"""Closes the file handle."""
if self.handles is not None:
self.handles.close()
self._engine.close()
Expand Down Expand Up @@ -1711,6 +1724,18 @@ def _failover_to_python(self) -> None:
raise AbstractMethodError(self)

def read(self, nrows: int | None = None) -> DataFrame:
"""
Reads the text file and stores the result in a DataFrame.

Parameters
----------
nrows: int, optional, default None
The number of rows to read in one go.

Returns
-------
DataFrame
"""
if self.engine == "pyarrow":
try:
# error: "ParserBase" has no attribute "read"
Expand Down