CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

akx · 2022-10-21T12:50:16Z

Fixes #48700
Closes #48922 (supersedes it)
Refs #9245
Refs #37639
Regressed in 6d1541e

closes StataReader processes whole file before reading in chunks #48700
Tests added and passed if fixing a bug or adding a new feature
- The existing tests for e.g. roundtripping zstandard-compressed Stata files test this code path.
- Added a test that checks e.g. a fp or BytesIO passed in is the same object the reader reads.
- Added a test that ensures a warning is raised if StataReader isn't being used as a context manager.
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR is a reworking of my earlier #48922, following @bashtage's suggestions. Unfortunately I couldn't quite make it fit in about 15 lines.

Compared to that PR, StataReader required some refactoring to have it behave as it has done before (see the discussion in the other PR) while also making it possible for it to raise a warning when data is being read without a with block being in use. (This does not take into account an user having code that doesn't use with, but that calls .close() by hand.)

The main refactoring is that all of the internal state of the reader object is now _private, and public bits are accessed via properties; this allows us to make sure the data is actually read when requested, even without reading happening in __init__. (Accessing those properties when the StataReader hasn't been __enter__ed will thus also raise a ResourceWarning.) As a bonus, the properties are now well-typed. Previously, e.g. reader.time_stamp had to rely on inference...

While renaming those properties, I noticed there was a bunch of repeated struct.unpack(...read...)[0] code that would have been black reformatted into very unwieldy lines following the addition of the underscores, so I decided to refactor the repeated code into smaller reading utility functions.

Finally, there's basically the same fix I made in #48922 – if the underlying stream is seekable, we can use it directly.

I also added a commit that removes the implicit automatic closing of the underlying stream, since it shouldn't be necessary anymore. If that should be removed from this PR, let me know.

akx · 2022-10-27T15:05:34Z

Could I ask @bashtage to review this? I'll take care of the handful of test failures once there's some reviews in – they seem to be related to the fact that we sometimes get the actual leak ResourceWarning and not the one we raise by hand...

bashtage

Overall looks very good. I think there are some places where it could be simplified, and most (all) of the properties are unnecessary.

doc/source/user_guide/io.rst

pandas/io/stata.py

pandas/tests/io/test_stata.py

sterlinm · 2022-11-07T17:20:23Z

pandas/io/stata.py

+        if (self._nobs == 0) and (nrows is None):
            self._can_read_value_labels = True
            self._data_read = True
-            self.close()
-            return DataFrame(columns=self.varlist)
+            return DataFrame(columns=self._varlist)


Hi! I'm very excited about these improvements so thanks so much! Not sure if this is something that would make sense to tackle in this pull request or as a separate follow-up but I think there is a bug here in the way that the StataReader class handles reading DTA files with zero rows.

If the user has specified columns, I think that this should actually return DataFrame(columns=columns) rather than DataFrame(columns=self._varlist).

I think there's some validation of the specified columns that is in _do_select_columns that might need to be moved to confirm that the columns specified are actually present in the empty dta file.

No problem at all if this isn't something that makes sense to address here just thought I'd mention it :)

There's another potential change here though I'm not sure I'd characterize it as a bug or not.

Separate from whether you respect the user specified columns, if you read an empty DTA file you get back an empty dataframe where all of the columns (currently self._varlist) are object dtype columns.

Even empty DTA files are typed though, and an alternative would be to map the DTA types to pandas types even if the DataFrame is empty.

I'm not as sure about what the proper behavior in this case though. There's a similar choice when reading data from databases where I think if you run a query that returns zero rows all of the columns will be object dtype as well.

Anyway once again feel free to ignore and thanks for the improvements!

https://pandas.pydata.org/docs/user_guide/io.html#reading-tables

Thanks for the comments! I think these should be addressed in a separate issue and PR later on - might be worth it to create the issue at this point (you could use the "Reference in new issue" tool in the comments' action menu). 😄

There's another potential change here though I'm not sure I'd characterize it as a bug or not.

Separate from whether you respect the user specified columns, if you read an empty DTA file you get back an empty dataframe where all of the columns (currently self._varlist) are object dtype columns.

Even empty DTA files are typed though, and an alternative would be to map the DTA types to pandas types even if the DataFrame is empty.

This is clearly an enhancement that is separate from this PR. Seems pretty harmless to do this, although I'm not sure what the upside is other than to roughly express some metadata from the DTA into pandas. It could perhaps be more useful when iterating, although precise column types are not usually guaranteed there.

Hi! I'm very excited about these improvements so thanks so much! Not sure if this is something that would make sense to tackle in this pull request or as a separate follow-up but I think there is a bug here in the way that the StataReader class handles reading DTA files with zero rows.

If the user has specified columns, I think that this should actually return DataFrame(columns=columns) rather than DataFrame(columns=self._varlist).

This is not an issue at the point where you are looking at the code. We read then entire file in and only then drop the columns. The DTA file format uses a row-wise storage (as opposed to more modern formats like parquet which use column stores) and so selecting individual columns efficiently is not that simple.

Agreed that it makes sense to handle this in a separate PR once this is merged.

I actually opened an issue about both of these earlier but haven't had a chance to work on it. I think it will be more straightforward we these improvements so I'm happy to try to tackle it myself.

#46240

This is not an issue at the point where you are looking at the code. We read then entire file in and only then drop the columns.

What I consider surprising is that if the DTA file has zero rows, columns is completely ignored. I'd expect the behavior to be the same as what you get with usecols in read_csv.

df = pd.DataFrame(data={"a": ["a", "b", "c"], "b": [1, 2, 3], "c": [4.0, 5.0, 6.0]}) df.head(0).to_stata("empty.dta", write_index=False) df.head(0).to_csv("empty.csv", index=False) # read_csv respects usecols assert list(pd.read_csv("empty.csv", usecols=["a"])) == ["a"] # read_csv raises error if usecols asks for a nonexistent column try: pd.read_csv('empty.csv', usecols=["x"]) except ValueError as exc: print('read_csv raises ValueError as expected') print(exc.args[0]) # read_stata returns all the columns either way assert ( list(pd.read_stata("empty.dta", columns=["a"])) == list(pd.read_stata("empty.dta", columns=["x"])) == ['a', 'b', 'c'] )

The DTA file format uses a row-wise storage (as opposed to more modern formats like parquet which use column stores) and so selecting individual columns efficiently is not that simple.

Agreed about it not being so simple. The use case where I came across these issues involves iterating over a directory of DTA files, some of which are very large and some of which might be empty, where you want a small subset of the columns. In that scenario it's convenient to have:

consistent types across chunks

consistent types even if the DTA file is empty

respect the columns parameter even if the file is empty

akx · 2022-11-07T18:27:33Z

@bashtage Could you give this another look? I rebased to fix the changelog conflict.

akx · 2022-11-15T12:45:06Z

Rebased once again. @bashtage, could you take a look so we could get this in? 😄 Thanks!

bashtage

2 working changes, and one more thought about the case for using the _ensure_open pattern vs just opening during __init__. I don't see much of a case of defer opening, but maybe missing something. There would be some reduction in complexity if it was always executed.

doc/source/user_guide/io.rst

pandas/io/stata.py

akx · 2022-11-29T08:26:08Z

@bashtage Rebased. Could you take another look, please?

simonjayhawkins · 2023-02-22T16:41:28Z

Thanks for rebasing. Looks like there's some failing tests, but once those are addressed this should be good to get into 2.0

now that the rc is cut, this should target 2.1? backports to 2.0.x should now ideally only be regressions/bugs reported during the rc period?

mroeschke · 2023-02-22T16:59:54Z

I was thinking since this PR had been ready & forgotten for a bit while we were cleaning up 1.x deprecations that it would be okay to have this in the official 2.0 release. Additionally, it includes a deprecation that would be nice to show to users in 2.0 rather than 2.1

akx · 2023-02-22T17:24:25Z

@mroeschke Tests seem to be mostly green and failures... seem to be spurious and not related to Stata?

simonjayhawkins · 2023-02-22T18:22:29Z

I was thinking since this PR had been ready & forgotten for a bit while we were cleaning up 1.x deprecations that it would be okay to have this in the official 2.0 release. Additionally, it includes a deprecation that would be nice to show to users in 2.0 rather than 2.1

since this is isolated to stata functionality then no objection if you want to target 2.0

bashtage · 2023-02-22T18:38:08Z

I agree that it is ready and would be good to have in 2.0.

bashtage · 2023-02-22T18:39:09Z

@mroeschke Do you still have any remaining issues that need to be addressed? If not, could you change to approval?

mroeschke · 2023-02-22T20:20:22Z

Yeah I don't think this was addressed #49228 (comment)

Could you add a commit with these changes?

diff --git a/doc/source/whatsnew/v2.0.0.rst b/doc/source/whatsnew/v2.0.0.rst
index 0445465507..987b408b77 100644
--- a/doc/source/whatsnew/v2.0.0.rst
+++ b/doc/source/whatsnew/v2.0.0.rst
@@ -855,6 +855,7 @@ Deprecations
 - Deprecated :meth:`Series.backfill` in favor of :meth:`Series.bfill` (:issue:`33396`)
 - Deprecated :meth:`DataFrame.pad` in favor of :meth:`DataFrame.ffill` (:issue:`33396`)
 - Deprecated :meth:`DataFrame.backfill` in favor of :meth:`DataFrame.bfill` (:issue:`33396`)
+- Deprecated :meth:`~pandas.io.stata.StataReader.close`. Use :class:`~pandas.io.stata.StataReader` as a context manager instead (:issue:`49228`)

 .. ---------------------------------------------------------------------------
 .. _whatsnew_200.prior_deprecations:
diff --git a/pandas/io/stata.py b/pandas/io/stata.py
index b3e7b186cb..4bf69f167f 100644
--- a/pandas/io/stata.py
+++ b/pandas/io/stata.py
@@ -1233,7 +1233,7 @@ class StataReader(StataParser, abc.Iterator):
             "will be removed in a future version without notice. "
             "Using StataReader as a context manager is the only supported method.",
             FutureWarning,
-            stacklevel=2,
+            stacklevel=find_stack_level(),
         )
         if self._close_file:
             self._close_file()

Refs pandas-dev#48922

… file into memory

mroeschke · 2023-02-23T22:10:02Z

Thanks for your patience and effort @akx. This will make it into 2.0 when it's released in a few weeks

… Stata file into memory) (#51600) Backport PR #49228: CLN/FIX/PERF: Don't buffer entire Stata file into memory Co-authored-by: Aarni Koskela <[email protected]>

akx · 2023-02-24T19:21:06Z

@mroeschke Thank you and @bashtage for seeing this through :) Have a great weekend! 🥂

bashtage · 2023-02-24T19:25:49Z

Thanks for noticing this issue and working through the concerns.

…

On Fri, Feb 24, 2023, 19:21 Aarni Koskela ***@***.***> wrote: @mroeschke <https://github.com/mroeschke> Thank you and @bashtage <https://github.com/bashtage> for seeing this through :) Have a great weekend! 🥂 — Reply to this email directly, view it on GitHub <#49228 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKTSRIAZABIEGQHD7LSM73WZEC25ANCNFSM6AAAAAARLEQOTY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

akx force-pushed the stata-no-memory-v2 branch from 1eead96 to c785c2c Compare October 21, 2022 12:51

akx mentioned this pull request Oct 21, 2022

CLN: use f-strings where possible #49229

Closed

4 tasks

akx force-pushed the stata-no-memory-v2 branch 2 times, most recently from 1c76fd3 to bf2b281 Compare October 21, 2022 14:18

mroeschke added the IO Stata read_stata, to_stata label Oct 21, 2022

akx force-pushed the stata-no-memory-v2 branch 3 times, most recently from d923fa3 to c82d013 Compare October 25, 2022 13:25

mroeschke requested a review from bashtage October 31, 2022 19:02

bashtage requested changes Nov 1, 2022

View reviewed changes

bashtage reviewed Nov 1, 2022

View reviewed changes

pandas/tests/io/test_stata.py Outdated Show resolved Hide resolved

akx force-pushed the stata-no-memory-v2 branch 3 times, most recently from f5c43ae to 1a9f932 Compare November 2, 2022 13:01

akx requested a review from bashtage November 2, 2022 13:04

akx force-pushed the stata-no-memory-v2 branch from 1a9f932 to 233a5e3 Compare November 3, 2022 17:32

akx mentioned this pull request Nov 4, 2022

REGR: be able to read Stata files without reading them fully into memory #48922

Closed

7 tasks

sterlinm reviewed Nov 7, 2022

View reviewed changes

akx force-pushed the stata-no-memory-v2 branch from e1fa0a6 to 5167079 Compare November 7, 2022 18:26

akx force-pushed the stata-no-memory-v2 branch 2 times, most recently from b440aff to f4c127c Compare November 8, 2022 15:30

akx force-pushed the stata-no-memory-v2 branch from f4c127c to 2388767 Compare November 15, 2022 12:45

bashtage requested changes Nov 15, 2022

View reviewed changes

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved

pandas/io/stata.py Outdated Show resolved Hide resolved

pandas/io/stata.py Outdated Show resolved Hide resolved

akx force-pushed the stata-no-memory-v2 branch from 2388767 to ea1caba Compare November 15, 2022 16:01

akx requested a review from bashtage November 15, 2022 16:23

akx force-pushed the stata-no-memory-v2 branch from ea1caba to 374f316 Compare November 29, 2022 08:25

akx force-pushed the stata-no-memory-v2 branch 2 times, most recently from 5e127a4 to 2e21cb5 Compare February 22, 2023 15:18

mroeschke added this to the 2.0 milestone Feb 22, 2023

akx force-pushed the stata-no-memory-v2 branch from 2e21cb5 to 0c0c26a Compare February 23, 2023 14:17

akx added 8 commits February 23, 2023 16:18

CLN: StataReader: refactor repeated struct.unpack/read calls to helpers

b2245ec

CLN: StataReader: replace string concatenations with f-strings

e409db9

CLN: StataReader: prefix internal state with underscore

b4db2b4

FIX: StataReader: defer opening file to when data is required

4602aed

FIX: StataReader: don't buffer entire file into memory unless necessary

d72d5f9

Refs pandas-dev#48922

DOC: Note that StataReaders are context managers

f34b04c

FIX: StataReader: don't close stream implicitly

2f8df0b

Apply review changes

b71a0bc

akx force-pushed the stata-no-memory-v2 branch from 0c0c26a to b71a0bc Compare February 23, 2023 14:18

akx requested a review from mroeschke February 23, 2023 14:18

mroeschke approved these changes Feb 23, 2023

View reviewed changes

mroeschke merged commit ae7b6d6 into pandas-dev:main Feb 23, 2023

meeseeksmachine mentioned this pull request Feb 23, 2023

Backport PR #49228 on branch 2.0.x (CLN/FIX/PERF: Don't buffer entire Stata file into memory) #51600

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 23, 2023

Backport PR pandas-dev#49228: CLN/FIX/PERF: Don't buffer entire Stata…

93a4332

… file into memory

tqa236 mentioned this pull request Mar 12, 2024

CLN: Remove unused private attributes in stata module #57818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

akx commented Oct 21, 2022 •

edited

Loading

akx commented Oct 27, 2022

bashtage left a comment

sterlinm Nov 7, 2022

sterlinm Nov 7, 2022

akx Nov 7, 2022

bashtage Nov 7, 2022

bashtage Nov 7, 2022

sterlinm Nov 8, 2022

akx commented Nov 7, 2022

akx commented Nov 15, 2022

bashtage left a comment

akx commented Nov 29, 2022

simonjayhawkins commented Feb 22, 2023

mroeschke commented Feb 22, 2023

akx commented Feb 22, 2023

simonjayhawkins commented Feb 22, 2023

bashtage commented Feb 22, 2023

bashtage commented Feb 22, 2023

mroeschke commented Feb 22, 2023

mroeschke commented Feb 23, 2023

akx commented Feb 24, 2023

bashtage commented Feb 24, 2023 via email

CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

Conversation

akx commented Oct 21, 2022 • edited Loading

akx commented Oct 27, 2022

bashtage left a comment

Choose a reason for hiding this comment

sterlinm Nov 7, 2022

Choose a reason for hiding this comment

sterlinm Nov 7, 2022

Choose a reason for hiding this comment

akx Nov 7, 2022

Choose a reason for hiding this comment

bashtage Nov 7, 2022

Choose a reason for hiding this comment

bashtage Nov 7, 2022

Choose a reason for hiding this comment

sterlinm Nov 8, 2022

Choose a reason for hiding this comment

akx commented Nov 7, 2022

akx commented Nov 15, 2022

bashtage left a comment

Choose a reason for hiding this comment

akx commented Nov 29, 2022

simonjayhawkins commented Feb 22, 2023

mroeschke commented Feb 22, 2023

akx commented Feb 22, 2023

simonjayhawkins commented Feb 22, 2023

bashtage commented Feb 22, 2023

bashtage commented Feb 22, 2023

mroeschke commented Feb 22, 2023

mroeschke commented Feb 23, 2023

akx commented Feb 24, 2023

bashtage commented Feb 24, 2023 via email

akx commented Oct 21, 2022 •

edited

Loading