-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
StataReader processes whole file before reading in chunks #48700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is that a behavior change you have noticed since 1.5 or did it also exist in previous versions? I think these particular lines of code are around since 1.3 but even before it (I think) it had a similar logic. I think the issue is that some IO-like objects are not seekable but read_stata does internally a lot of seeking (some of the compressions IO doesn't support seeking). It might be the case that we can change the above line to only completely read the file if it isn't seekable. |
I don't think it changed in 1.5, I had noticed it with 1.4. I didn't look back to see when it was introduced or if it has always been there. I see the uses of seek in parsing the header but it seems like it should be possible to avoid that. EDIT: Commented to soon, I think the suggestion to skip that when the file is seekable is simpler. |
Feel free to open a PR! I think the main change is self.handles = get_handle(...)
if hasattr(self.handles.handle, "seekable") and self.handles.handle.seekable:
self.path_or_buf = self.handles.handle
else:
with self.handles:
self.path_or_buf = BytesIO(handles.handle.read())
# and then appropriate code to close self.handles (and self.path_or_buf in case of BytesIO) |
I'll give it a shot over the weekend. Thanks! |
By the looks of it, 2f0ada3 was the commit that changed this behavior, way back when. (Came here via answering https://stackoverflow.com/a/73934594/51685 :-) ) |
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Regressed in pandas-dev#9245 Regressed in 2f0ada3
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Amazing, thanks so much! |
I've noticed that when reading large Stata files using the chunksize parameter the time it takes to create the StataReader object is affected by the size of the file. This is a bit surprising since all of the metadata it needs is contained in the file header so it seems like it should take the same time regardless of the total file size.
I took a look at the code and it seems like the culprit is this line that reads the entire file into a BytesIO object before parsing the header. I'm not entirely sure what this accomplishes. Ideally it would be nice to be able to create the StataReader object after processing just the header portion of the file.
pandas/pandas/io/stata.py
Lines 1167 to 1175 in 71fc89c
The text was updated successfully, but these errors were encountered: