-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: move get_filepath_buffer into get_handle #37639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow this is a quite a lot of simplification. some questions and a couple requests for followup (don't add to this PR)
pandas/io/parsers.py
Outdated
|
||
if isinstance(src, list): | ||
self.handles = IOHandles( | ||
handle=src, compression={"method": None} # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that isn't nice (especially from a typing perspective). The parsers seems to cope with objects implementing __next__
(for example lists). I think a solution would be to have a new attribute for the parser (maybe self.iterable : Optional[Iteratable]
). If it is None, use self.handles otherwise use the iterable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch is only used by read_excel
and is only used for the PythonParser
. I typed src
and added comments about this special case.
I will try to fix this mess later this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 'best' solution is to temporarily silence mypy and put a list in IOHandles
. PythonParser
(the only parser affected by that) had already a check whether the 'file handle' has readline
, if not (for the list from read_excel
) it treats it as raw data.
edit:
I found a better solution (well-typed and call get_handle
after all potential raise(...)
calls to avoid leaking file handles)
Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-11-13 06:00:27 UTC |
I made |
contains a "b": | ||
``df.to_csv(..., mode="wb")`` allows writing a CSV to a file object | ||
opened binary mode. In most cases, it is not necessary to specify | ||
``mode`` as Pandas will auto-detect whether the file object is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet found an object that needs an explicit mode
self.ioargs.filepath_or_buffer = BytesIO(contents) # type: ignore[arg-type] | ||
self.ioargs.should_close = True | ||
self.path_or_buf = cast(BytesIO, self.ioargs.filepath_or_buffer) | ||
contents = handles.handle.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why that is done. The file handle should always be in binary mode. If it wasn't in binary mode read
would return a string which BytesIO
would complain about. Maybe there is a performance reason (seeking faster in BytesIO
than the provided buffer, just speculating)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, this is way cleaner that before even. once more merge master and ping on green.
thanks @twoertwein very nice! |
@twoertwein I have a question about the code you added in this PR. In #31817 I'm adding a new csv reader engine (pyarrow) to which I have to provide class BytesIOWrapper:
"""
Allows the pyarrow engine for read_csv() to read from string buffers
"""
def __init__(self, string_buffer: StringIO, encoding: str = "utf-8"):
self.string_buffer = string_buffer
self.encoding = encoding
def __getattr__(self, attr: str):
return getattr(self.string_buffer, attr)
def read(self, size: int = -1):
content = self.string_buffer.read(size)
return content.encode(self.encoding) but I'm wondering if this can be done using Thanks so much! |
Yes, Towards the end of I assume that the following code might get you quite far if not (is_text or _is_binary_mode(handle, mode)):
handle = BytesIOWrapper(handle, encoding)
mode = mode.replace("b", "") |
@arw2019 I think that the About adding the |
@twoertwein Thank you so much for these responses!!! I'm thinking I'll do a separate PR to add the |
…straint due to renaming of `get_filepath_or_buffer` pandas-dev/pandas#37639 pandas-dev/pandas@6d1541e#diff-934d8564d648e7521db673c6399dcac98e45adfd5230ba47d3aabfcc21979febL247 PEtab-dev/PEtab#493
…fer` private without major release (semver!?) pandas-dev/pandas#37639 pandas-dev/pandas@6d1541e#diff-934d8564d648e7521db673c6399dcac98e45adfd5230ba47d3aabfcc21979febL247
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR makes
get_filepath_buffer
a private function and is called insideget_handle
- one function to rule all of IO ;)This PR will make it easier for future PRs to make the IO-related interface of
read/to_*
more consistent as most of them should support compression/memory mapping (for reading)/(binary) file handles/storage options.Notes to keep track of future follow-up PRs:
get_handle
storage_options
forto_excel