-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1eead96
to
c785c2c
Compare
1c76fd3
to
bf2b281
Compare
d923fa3
to
c82d013
Compare
Could I ask @bashtage to review this? I'll take care of the handful of test failures once there's some reviews in – they seem to be related to the fact that we sometimes get the actual leak ResourceWarning and not the one we raise by hand... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks very good. I think there are some places where it could be simplified, and most (all) of the properties are unnecessary.
f5c43ae
to
1a9f932
Compare
1a9f932
to
233a5e3
Compare
if (self._nobs == 0) and (nrows is None): | ||
self._can_read_value_labels = True | ||
self._data_read = True | ||
self.close() | ||
return DataFrame(columns=self.varlist) | ||
return DataFrame(columns=self._varlist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I'm very excited about these improvements so thanks so much! Not sure if this is something that would make sense to tackle in this pull request or as a separate follow-up but I think there is a bug here in the way that the StataReader class handles reading DTA files with zero rows.
If the user has specified columns, I think that this should actually return DataFrame(columns=columns)
rather than DataFrame(columns=self._varlist)
.
I think there's some validation of the specified columns that is in _do_select_columns
that might need to be moved to confirm that the columns specified are actually present in the empty dta file.
No problem at all if this isn't something that makes sense to address here just thought I'd mention it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another potential change here though I'm not sure I'd characterize it as a bug or not.
Separate from whether you respect the user specified columns, if you read an empty DTA file you get back an empty dataframe where all of the columns (currently self._varlist) are object dtype columns.
Even empty DTA files are typed though, and an alternative would be to map the DTA types to pandas types even if the DataFrame is empty.
I'm not as sure about what the proper behavior in this case though. There's a similar choice when reading data from databases where I think if you run a query that returns zero rows all of the columns will be object dtype as well.
Anyway once again feel free to ignore and thanks for the improvements!
https://pandas.pydata.org/docs/user_guide/io.html#reading-tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments! I think these should be addressed in a separate issue and PR later on - might be worth it to create the issue at this point (you could use the "Reference in new issue" tool in the comments' action menu). 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another potential change here though I'm not sure I'd characterize it as a bug or not.
Separate from whether you respect the user specified columns, if you read an empty DTA file you get back an empty dataframe where all of the columns (currently self._varlist) are object dtype columns.
Even empty DTA files are typed though, and an alternative would be to map the DTA types to pandas types even if the DataFrame is empty.
This is clearly an enhancement that is separate from this PR. Seems pretty harmless to do this, although I'm not sure what the upside is other than to roughly express some metadata from the DTA into pandas. It could perhaps be more useful when iterating, although precise column types are not usually guaranteed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I'm very excited about these improvements so thanks so much! Not sure if this is something that would make sense to tackle in this pull request or as a separate follow-up but I think there is a bug here in the way that the StataReader class handles reading DTA files with zero rows.
If the user has specified columns, I think that this should actually return DataFrame(columns=columns) rather than DataFrame(columns=self._varlist).
This is not an issue at the point where you are looking at the code. We read then entire file in and only then drop the columns. The DTA file format uses a row-wise storage (as opposed to more modern formats like parquet which use column stores) and so selecting individual columns efficiently is not that simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it makes sense to handle this in a separate PR once this is merged.
I actually opened an issue about both of these earlier but haven't had a chance to work on it. I think it will be more straightforward we these improvements so I'm happy to try to tackle it myself.
This is not an issue at the point where you are looking at the code. We read then entire file in and only then drop the columns.
What I consider surprising is that if the DTA file has zero rows, columns is completely ignored. I'd expect the behavior to be the same as what you get with usecols
in read_csv
.
df = pd.DataFrame(data={"a": ["a", "b", "c"], "b": [1, 2, 3], "c": [4.0, 5.0, 6.0]})
df.head(0).to_stata("empty.dta", write_index=False)
df.head(0).to_csv("empty.csv", index=False)
# read_csv respects usecols
assert list(pd.read_csv("empty.csv", usecols=["a"])) == ["a"]
# read_csv raises error if usecols asks for a nonexistent column
try:
pd.read_csv('empty.csv', usecols=["x"])
except ValueError as exc:
print('read_csv raises ValueError as expected')
print(exc.args[0])
# read_stata returns all the columns either way
assert (
list(pd.read_stata("empty.dta", columns=["a"]))
== list(pd.read_stata("empty.dta", columns=["x"]))
== ['a', 'b', 'c']
)
The DTA file format uses a row-wise storage (as opposed to more modern formats like parquet which use column stores) and so selecting individual columns efficiently is not that simple.
Agreed about it not being so simple. The use case where I came across these issues involves iterating over a directory of DTA files, some of which are very large and some of which might be empty, where you want a small subset of the columns. In that scenario it's convenient to have:
- consistent types across chunks
- consistent types even if the DTA file is empty
- respect the
columns
parameter even if the file is empty
e1fa0a6
to
5167079
Compare
@bashtage Could you give this another look? I rebased to fix the changelog conflict. |
b440aff
to
f4c127c
Compare
Rebased once again. @bashtage, could you take a look so we could get this in? 😄 Thanks! |
f4c127c
to
2388767
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 working changes, and one more thought about the case for using the _ensure_open
pattern vs just opening during __init__
. I don't see much of a case of defer opening, but maybe missing something. There would be some reduction in complexity if it was always executed.
2388767
to
ea1caba
Compare
ea1caba
to
374f316
Compare
@bashtage Rebased. Could you take another look, please? |
5e127a4
to
2e21cb5
Compare
now that the rc is cut, this should target 2.1? backports to 2.0.x should now ideally only be regressions/bugs reported during the rc period? |
I was thinking since this PR had been ready & forgotten for a bit while we were cleaning up 1.x deprecations that it would be okay to have this in the official 2.0 release. Additionally, it includes a deprecation that would be nice to show to users in 2.0 rather than 2.1 |
@mroeschke Tests seem to be mostly green and failures... seem to be spurious and not related to Stata? |
since this is isolated to stata functionality then no objection if you want to target 2.0 |
I agree that it is ready and would be good to have in 2.0. |
@mroeschke Do you still have any remaining issues that need to be addressed? If not, could you change to approval? |
Yeah I don't think this was addressed #49228 (comment) Could you add a commit with these changes?
|
2e21cb5
to
0c0c26a
Compare
0c0c26a
to
b71a0bc
Compare
… file into memory
Thanks for your patience and effort @akx. This will make it into 2.0 when it's released in a few weeks |
… Stata file into memory) (#51600) Backport PR #49228: CLN/FIX/PERF: Don't buffer entire Stata file into memory Co-authored-by: Aarni Koskela <[email protected]>
@mroeschke Thank you and @bashtage for seeing this through :) Have a great weekend! 🥂 |
Thanks for noticing this issue and working through the concerns.
…On Fri, Feb 24, 2023, 19:21 Aarni Koskela ***@***.***> wrote:
@mroeschke <https://github.com/mroeschke> Thank you and @bashtage
<https://github.com/bashtage> for seeing this through :) Have a great
weekend! 🥂
—
Reply to this email directly, view it on GitHub
<#49228 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSRIAZABIEGQHD7LSM73WZEC25ANCNFSM6AAAAAARLEQOTY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Fixes #48700
Closes #48922 (supersedes it)
Refs #9245
Refs #37639
Regressed in 6d1541e
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This PR is a reworking of my earlier #48922, following @bashtage's suggestions. Unfortunately I couldn't quite make it fit in about 15 lines.
Compared to that PR, StataReader required some refactoring to have it behave as it has done before (see the discussion in the other PR) while also making it possible for it to raise a warning when data is being read without a
with
block being in use. (This does not take into account an user having code that doesn't usewith
, but that calls.close()
by hand.)The main refactoring is that all of the internal state of the reader object is now
_private
, and public bits are accessed via properties; this allows us to make sure the data is actually read when requested, even without reading happening in__init__
. (Accessing those properties when the StataReader hasn't been__enter__
ed will thus also raise a ResourceWarning.) As a bonus, the properties are now well-typed. Previously, e.g.reader.time_stamp
had to rely on inference...While renaming those properties, I noticed there was a bunch of repeated
struct.unpack(...read...)[0]
code that would have beenblack
reformatted into very unwieldy lines following the addition of the underscores, so I decided to refactor the repeated code into smaller reading utility functions.Finally, there's basically the same fix I made in #48922 – if the underlying stream is seekable, we can use it directly.
I also added a commit that removes the implicit automatic closing of the underlying stream, since it shouldn't be necessary anymore. If that should be removed from this PR, let me know.