Skip to content

Commit 2d368bc

Browse files
committed
Merge pull request #9493 from kshedden/stata_read_chunk
ENH Read Stata dta file incrementally
2 parents f8fd05d + 709c034 commit 2d368bc

File tree

5 files changed

+570
-219
lines changed

5 files changed

+570
-219
lines changed

doc/source/io.rst

+26-7
Original file line numberDiff line numberDiff line change
@@ -3821,22 +3821,41 @@ outside of this range, the variable is cast to ``int16``.
38213821
Reading from Stata format
38223822
~~~~~~~~~~~~~~~~~~~~~~~~~
38233823

3824-
The top-level function ``read_stata`` will read a dta files
3825-
and return a DataFrame. Alternatively, the class :class:`~pandas.io.stata.StataReader`
3826-
can be used if more granular access is required. :class:`~pandas.io.stata.StataReader`
3827-
reads the header of the dta file at initialization. The method
3828-
:func:`~pandas.io.stata.StataReader.data` reads and converts observations to a DataFrame.
3824+
The top-level function ``read_stata`` will read a dta file and return
3825+
either a DataFrame or a :class:`~pandas.io.stata.StataReader` that can
3826+
be used to read the file incrementally.
38293827

38303828
.. ipython:: python
38313829
38323830
pd.read_stata('stata.dta')
38333831
3832+
.. versionadded:: 0.16.0
3833+
3834+
Specifying a ``chunksize`` yields a
3835+
:class:`~pandas.io.stata.StataReader` instance that can be used to
3836+
read ``chunksize`` lines from the file at a time. The ``StataReader``
3837+
object can be used as an iterator.
3838+
3839+
reader = pd.read_stata('stata.dta', chunksize=1000)
3840+
for df in reader:
3841+
do_something(df)
3842+
3843+
For more fine-grained control, use ``iterator=True`` and specify
3844+
``chunksize`` with each call to
3845+
:func:`~pandas.io.stata.StataReader.read`.
3846+
3847+
.. ipython:: python
3848+
3849+
reader = pd.read_stata('stata.dta', iterator=True)
3850+
chunk1 = reader.read(10)
3851+
chunk2 = reader.read(20)
3852+
38343853
Currently the ``index`` is retrieved as a column.
38353854

38363855
The parameter ``convert_categoricals`` indicates whether value labels should be
38373856
read and used to create a ``Categorical`` variable from them. Value labels can
3838-
also be retrieved by the function ``variable_labels``, which requires data to be
3839-
called before use (see ``pandas.io.stata.StataReader``).
3857+
also be retrieved by the function ``value_labels``, which requires :func:`~pandas.io.stata.StataReader.read`
3858+
to be called before use.
38403859

38413860
The parameter ``convert_missing`` indicates whether missing value
38423861
representations in Stata should be preserved. If ``False`` (the default),

doc/source/release.rst

+2
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ performance improvements along with a large number of bug fixes.
5555

5656
Highlights include:
5757

58+
- Allow Stata files to be read incrementally, support for long strings in Stata files (issue:`9493`:) :ref:`here<io.stata_reader>`.
59+
5860
See the :ref:`v0.16.0 Whatsnew <whatsnew_0160>` overview or the issue tracker on GitHub for an extensive list
5961
of all API changes, enhancements and bugs that have been fixed in 0.16.0.
6062

0 commit comments

Comments
 (0)