Skip to content

CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 23, 2023

Conversation

akx
Copy link
Contributor

@akx akx commented Oct 21, 2022

Fixes #48700
Closes #48922 (supersedes it)
Refs #9245
Refs #37639
Regressed in 6d1541e

  • closes StataReader processes whole file before reading in chunks #48700
  • Tests added and passed if fixing a bug or adding a new feature
    • The existing tests for e.g. roundtripping zstandard-compressed Stata files test this code path.
    • Added a test that checks e.g. a fp or BytesIO passed in is the same object the reader reads.
    • Added a test that ensures a warning is raised if StataReader isn't being used as a context manager.
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR is a reworking of my earlier #48922, following @bashtage's suggestions. Unfortunately I couldn't quite make it fit in about 15 lines.

Compared to that PR, StataReader required some refactoring to have it behave as it has done before (see the discussion in the other PR) while also making it possible for it to raise a warning when data is being read without a with block being in use. (This does not take into account an user having code that doesn't use with, but that calls .close() by hand.)

The main refactoring is that all of the internal state of the reader object is now _private, and public bits are accessed via properties; this allows us to make sure the data is actually read when requested, even without reading happening in __init__. (Accessing those properties when the StataReader hasn't been __enter__ed will thus also raise a ResourceWarning.) As a bonus, the properties are now well-typed. Previously, e.g. reader.time_stamp had to rely on inference...

While renaming those properties, I noticed there was a bunch of repeated struct.unpack(...read...)[0] code that would have been black reformatted into very unwieldy lines following the addition of the underscores, so I decided to refactor the repeated code into smaller reading utility functions.

Finally, there's basically the same fix I made in #48922 – if the underlying stream is seekable, we can use it directly.

I also added a commit that removes the implicit automatic closing of the underlying stream, since it shouldn't be necessary anymore. If that should be removed from this PR, let me know.

@akx akx force-pushed the stata-no-memory-v2 branch from 1eead96 to c785c2c Compare October 21, 2022 12:51
@akx akx mentioned this pull request Oct 21, 2022
4 tasks
@akx akx force-pushed the stata-no-memory-v2 branch 2 times, most recently from 1c76fd3 to bf2b281 Compare October 21, 2022 14:18
@mroeschke mroeschke added the IO Stata read_stata, to_stata label Oct 21, 2022
@akx akx force-pushed the stata-no-memory-v2 branch 3 times, most recently from d923fa3 to c82d013 Compare October 25, 2022 13:25
@akx
Copy link
Contributor Author

akx commented Oct 27, 2022

Could I ask @bashtage to review this? I'll take care of the handful of test failures once there's some reviews in – they seem to be related to the fact that we sometimes get the actual leak ResourceWarning and not the one we raise by hand...

@mroeschke mroeschke requested a review from bashtage October 31, 2022 19:02
Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very good. I think there are some places where it could be simplified, and most (all) of the properties are unnecessary.

@akx akx force-pushed the stata-no-memory-v2 branch 3 times, most recently from f5c43ae to 1a9f932 Compare November 2, 2022 13:01
@akx akx requested a review from bashtage November 2, 2022 13:04
@akx akx force-pushed the stata-no-memory-v2 branch from 1a9f932 to 233a5e3 Compare November 3, 2022 17:32
Comment on lines +1711 to +1709
if (self._nobs == 0) and (nrows is None):
self._can_read_value_labels = True
self._data_read = True
self.close()
return DataFrame(columns=self.varlist)
return DataFrame(columns=self._varlist)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm very excited about these improvements so thanks so much! Not sure if this is something that would make sense to tackle in this pull request or as a separate follow-up but I think there is a bug here in the way that the StataReader class handles reading DTA files with zero rows.

If the user has specified columns, I think that this should actually return DataFrame(columns=columns) rather than DataFrame(columns=self._varlist).

I think there's some validation of the specified columns that is in _do_select_columns that might need to be moved to confirm that the columns specified are actually present in the empty dta file.

No problem at all if this isn't something that makes sense to address here just thought I'd mention it :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another potential change here though I'm not sure I'd characterize it as a bug or not.

Separate from whether you respect the user specified columns, if you read an empty DTA file you get back an empty dataframe where all of the columns (currently self._varlist) are object dtype columns.

Even empty DTA files are typed though, and an alternative would be to map the DTA types to pandas types even if the DataFrame is empty.

I'm not as sure about what the proper behavior in this case though. There's a similar choice when reading data from databases where I think if you run a query that returns zero rows all of the columns will be object dtype as well.

Anyway once again feel free to ignore and thanks for the improvements!

https://pandas.pydata.org/docs/user_guide/io.html#reading-tables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! I think these should be addressed in a separate issue and PR later on - might be worth it to create the issue at this point (you could use the "Reference in new issue" tool in the comments' action menu). 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another potential change here though I'm not sure I'd characterize it as a bug or not.

Separate from whether you respect the user specified columns, if you read an empty DTA file you get back an empty dataframe where all of the columns (currently self._varlist) are object dtype columns.

Even empty DTA files are typed though, and an alternative would be to map the DTA types to pandas types even if the DataFrame is empty.

This is clearly an enhancement that is separate from this PR. Seems pretty harmless to do this, although I'm not sure what the upside is other than to roughly express some metadata from the DTA into pandas. It could perhaps be more useful when iterating, although precise column types are not usually guaranteed there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm very excited about these improvements so thanks so much! Not sure if this is something that would make sense to tackle in this pull request or as a separate follow-up but I think there is a bug here in the way that the StataReader class handles reading DTA files with zero rows.

If the user has specified columns, I think that this should actually return DataFrame(columns=columns) rather than DataFrame(columns=self._varlist).

This is not an issue at the point where you are looking at the code. We read then entire file in and only then drop the columns. The DTA file format uses a row-wise storage (as opposed to more modern formats like parquet which use column stores) and so selecting individual columns efficiently is not that simple.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it makes sense to handle this in a separate PR once this is merged.

I actually opened an issue about both of these earlier but haven't had a chance to work on it. I think it will be more straightforward we these improvements so I'm happy to try to tackle it myself.

#46240

This is not an issue at the point where you are looking at the code. We read then entire file in and only then drop the columns.

What I consider surprising is that if the DTA file has zero rows, columns is completely ignored. I'd expect the behavior to be the same as what you get with usecols in read_csv.

df = pd.DataFrame(data={"a": ["a", "b", "c"], "b": [1, 2, 3], "c": [4.0, 5.0, 6.0]})
df.head(0).to_stata("empty.dta", write_index=False)
df.head(0).to_csv("empty.csv", index=False)

# read_csv respects usecols
assert list(pd.read_csv("empty.csv", usecols=["a"])) == ["a"]

# read_csv raises error if usecols asks for a nonexistent column
try:
    pd.read_csv('empty.csv', usecols=["x"])
except ValueError as exc:
    print('read_csv raises ValueError as expected')
    print(exc.args[0])

# read_stata returns all the columns either way
assert (
    list(pd.read_stata("empty.dta", columns=["a"]))
    == list(pd.read_stata("empty.dta", columns=["x"]))
    == ['a', 'b', 'c']
)

The DTA file format uses a row-wise storage (as opposed to more modern formats like parquet which use column stores) and so selecting individual columns efficiently is not that simple.

Agreed about it not being so simple. The use case where I came across these issues involves iterating over a directory of DTA files, some of which are very large and some of which might be empty, where you want a small subset of the columns. In that scenario it's convenient to have:

  • consistent types across chunks
  • consistent types even if the DTA file is empty
  • respect the columns parameter even if the file is empty

@akx akx force-pushed the stata-no-memory-v2 branch from e1fa0a6 to 5167079 Compare November 7, 2022 18:26
@akx
Copy link
Contributor Author

akx commented Nov 7, 2022

@bashtage Could you give this another look? I rebased to fix the changelog conflict.

@akx akx force-pushed the stata-no-memory-v2 branch 2 times, most recently from b440aff to f4c127c Compare November 8, 2022 15:30
@akx
Copy link
Contributor Author

akx commented Nov 15, 2022

Rebased once again. @bashtage, could you take a look so we could get this in? 😄 Thanks!

@akx akx force-pushed the stata-no-memory-v2 branch from f4c127c to 2388767 Compare November 15, 2022 12:45
Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 working changes, and one more thought about the case for using the _ensure_open pattern vs just opening during __init__. I don't see much of a case of defer opening, but maybe missing something. There would be some reduction in complexity if it was always executed.

@akx akx force-pushed the stata-no-memory-v2 branch from 2388767 to ea1caba Compare November 15, 2022 16:01
@akx akx requested a review from bashtage November 15, 2022 16:23
@akx akx force-pushed the stata-no-memory-v2 branch from ea1caba to 374f316 Compare November 29, 2022 08:25
@akx
Copy link
Contributor Author

akx commented Nov 29, 2022

@bashtage Rebased. Could you take another look, please?

@akx akx force-pushed the stata-no-memory-v2 branch 2 times, most recently from 5e127a4 to 2e21cb5 Compare February 22, 2023 15:18
@simonjayhawkins
Copy link
Member

Thanks for rebasing. Looks like there's some failing tests, but once those are addressed this should be good to get into 2.0

now that the rc is cut, this should target 2.1? backports to 2.0.x should now ideally only be regressions/bugs reported during the rc period?

@mroeschke
Copy link
Member

I was thinking since this PR had been ready & forgotten for a bit while we were cleaning up 1.x deprecations that it would be okay to have this in the official 2.0 release. Additionally, it includes a deprecation that would be nice to show to users in 2.0 rather than 2.1

@akx
Copy link
Contributor Author

akx commented Feb 22, 2023

@mroeschke Tests seem to be mostly green and failures... seem to be spurious and not related to Stata?

@simonjayhawkins
Copy link
Member

I was thinking since this PR had been ready & forgotten for a bit while we were cleaning up 1.x deprecations that it would be okay to have this in the official 2.0 release. Additionally, it includes a deprecation that would be nice to show to users in 2.0 rather than 2.1

since this is isolated to stata functionality then no objection if you want to target 2.0

@bashtage
Copy link
Contributor

I agree that it is ready and would be good to have in 2.0.

@bashtage
Copy link
Contributor

@mroeschke Do you still have any remaining issues that need to be addressed? If not, could you change to approval?

@mroeschke
Copy link
Member

Yeah I don't think this was addressed #49228 (comment)

Could you add a commit with these changes?

diff --git a/doc/source/whatsnew/v2.0.0.rst b/doc/source/whatsnew/v2.0.0.rst
index 0445465507..987b408b77 100644
--- a/doc/source/whatsnew/v2.0.0.rst
+++ b/doc/source/whatsnew/v2.0.0.rst
@@ -855,6 +855,7 @@ Deprecations
 - Deprecated :meth:`Series.backfill` in favor of :meth:`Series.bfill` (:issue:`33396`)
 - Deprecated :meth:`DataFrame.pad` in favor of :meth:`DataFrame.ffill` (:issue:`33396`)
 - Deprecated :meth:`DataFrame.backfill` in favor of :meth:`DataFrame.bfill` (:issue:`33396`)
+- Deprecated :meth:`~pandas.io.stata.StataReader.close`. Use :class:`~pandas.io.stata.StataReader` as a context manager instead (:issue:`49228`)

 .. ---------------------------------------------------------------------------
 .. _whatsnew_200.prior_deprecations:
diff --git a/pandas/io/stata.py b/pandas/io/stata.py
index b3e7b186cb..4bf69f167f 100644
--- a/pandas/io/stata.py
+++ b/pandas/io/stata.py
@@ -1233,7 +1233,7 @@ class StataReader(StataParser, abc.Iterator):
             "will be removed in a future version without notice. "
             "Using StataReader as a context manager is the only supported method.",
             FutureWarning,
-            stacklevel=2,
+            stacklevel=find_stack_level(),
         )
         if self._close_file:
             self._close_file()

@mroeschke mroeschke added this to the 2.0 milestone Feb 22, 2023
@akx akx force-pushed the stata-no-memory-v2 branch from 2e21cb5 to 0c0c26a Compare February 23, 2023 14:17
@akx akx force-pushed the stata-no-memory-v2 branch from 0c0c26a to b71a0bc Compare February 23, 2023 14:18
@akx akx requested a review from mroeschke February 23, 2023 14:18
@mroeschke
Copy link
Member

Thanks for your patience and effort @akx. This will make it into 2.0 when it's released in a few weeks

phofl pushed a commit that referenced this pull request Feb 24, 2023
… Stata file into memory) (#51600)

Backport PR #49228: CLN/FIX/PERF: Don't buffer entire Stata file into memory

Co-authored-by: Aarni Koskela <[email protected]>
@akx
Copy link
Contributor Author

akx commented Feb 24, 2023

@mroeschke Thank you and @bashtage for seeing this through :) Have a great weekend! 🥂

@bashtage
Copy link
Contributor

bashtage commented Feb 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StataReader processes whole file before reading in chunks
5 participants