Skip to content

Commit 709c034

Browse files
committed
Read Stata file incrementally
Remove testing code Use partition in null_terminate Manage warnings better in test Further warning management in testing; add skip_data argument Major refactoring to address code review Fix strl reading, templatize docstrings Fix bug in attaching docstring Add new test file Add release note Call read instead of data when calling pandas.read_stata various small issues following code review Improve performance of %td processing Docs edit (minor)
1 parent c88b0ba commit 709c034

File tree

5 files changed

+570
-219
lines changed

5 files changed

+570
-219
lines changed

doc/source/io.rst

+26-7
Original file line numberDiff line numberDiff line change
@@ -3821,22 +3821,41 @@ outside of this range, the variable is cast to ``int16``.
38213821
Reading from Stata format
38223822
~~~~~~~~~~~~~~~~~~~~~~~~~
38233823

3824-
The top-level function ``read_stata`` will read a dta files
3825-
and return a DataFrame. Alternatively, the class :class:`~pandas.io.stata.StataReader`
3826-
can be used if more granular access is required. :class:`~pandas.io.stata.StataReader`
3827-
reads the header of the dta file at initialization. The method
3828-
:func:`~pandas.io.stata.StataReader.data` reads and converts observations to a DataFrame.
3824+
The top-level function ``read_stata`` will read a dta file and return
3825+
either a DataFrame or a :class:`~pandas.io.stata.StataReader` that can
3826+
be used to read the file incrementally.
38293827

38303828
.. ipython:: python
38313829
38323830
pd.read_stata('stata.dta')
38333831
3832+
.. versionadded:: 0.16.0
3833+
3834+
Specifying a ``chunksize`` yields a
3835+
:class:`~pandas.io.stata.StataReader` instance that can be used to
3836+
read ``chunksize`` lines from the file at a time. The ``StataReader``
3837+
object can be used as an iterator.
3838+
3839+
reader = pd.read_stata('stata.dta', chunksize=1000)
3840+
for df in reader:
3841+
do_something(df)
3842+
3843+
For more fine-grained control, use ``iterator=True`` and specify
3844+
``chunksize`` with each call to
3845+
:func:`~pandas.io.stata.StataReader.read`.
3846+
3847+
.. ipython:: python
3848+
3849+
reader = pd.read_stata('stata.dta', iterator=True)
3850+
chunk1 = reader.read(10)
3851+
chunk2 = reader.read(20)
3852+
38343853
Currently the ``index`` is retrieved as a column.
38353854

38363855
The parameter ``convert_categoricals`` indicates whether value labels should be
38373856
read and used to create a ``Categorical`` variable from them. Value labels can
3838-
also be retrieved by the function ``variable_labels``, which requires data to be
3839-
called before use (see ``pandas.io.stata.StataReader``).
3857+
also be retrieved by the function ``value_labels``, which requires :func:`~pandas.io.stata.StataReader.read`
3858+
to be called before use.
38403859

38413860
The parameter ``convert_missing`` indicates whether missing value
38423861
representations in Stata should be preserved. If ``False`` (the default),

doc/source/release.rst

+2
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ performance improvements along with a large number of bug fixes.
5555

5656
Highlights include:
5757

58+
- Allow Stata files to be read incrementally, support for long strings in Stata files (issue:`9493`:) :ref:`here<io.stata_reader>`.
59+
5860
See the :ref:`v0.16.0 Whatsnew <whatsnew_0160>` overview or the issue tracker on GitHub for an extensive list
5961
of all API changes, enhancements and bugs that have been fixed in 0.16.0.
6062

0 commit comments

Comments
 (0)