Skip to content

REF: move get_filepath_buffer into get_handle #37639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 13, 2020
Merged

REF: move get_filepath_buffer into get_handle #37639

merged 2 commits into from
Nov 13, 2020

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Nov 5, 2020

This PR makes get_filepath_buffer a private function and is called inside get_handle - one function to rule all of IO ;)

This PR will make it easier for future PRs to make the IO-related interface of read/to_* more consistent as most of them should support compression/memory mapping (for reading)/(binary) file handles/storage options.

Notes to keep track of future follow-up PRs:

  • context manager for get_handle
  • storage_options for to_excel

@twoertwein twoertwein marked this pull request as ready for review November 5, 2020 08:31
@twoertwein twoertwein marked this pull request as draft November 5, 2020 19:41
@jreback jreback added IO Data IO issues that don't fit into a more specific label Refactor Internal refactoring of code labels Nov 6, 2020
@twoertwein twoertwein marked this pull request as ready for review November 7, 2020 19:06
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this is a quite a lot of simplification. some questions and a couple requests for followup (don't add to this PR)


if isinstance(src, list):
self.handles = IOHandles(
handle=src, compression={"method": None} # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that isn't nice (especially from a typing perspective). The parsers seems to cope with objects implementing __next__ (for example lists). I think a solution would be to have a new attribute for the parser (maybe self.iterable : Optional[Iteratable]). If it is None, use self.handles otherwise use the iterable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch is only used by read_excel and is only used for the PythonParser. I typed src and added comments about this special case.

I will try to fix this mess later this week.

Copy link
Member Author

@twoertwein twoertwein Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 'best' solution is to temporarily silence mypy and put a list in IOHandles. PythonParser (the only parser affected by that) had already a check whether the 'file handle' has readline, if not (for the list from read_excel) it treats it as raw data.

edit:
I found a better solution (well-typed and call get_handle after all potential raise(...) calls to avoid leaking file handles)

@jreback jreback added this to the 1.2 milestone Nov 10, 2020
@pep8speaks
Copy link

pep8speaks commented Nov 12, 2020

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-13 06:00:27 UTC

@twoertwein
Copy link
Member Author

twoertwein commented Nov 12, 2020

I made IOHandles a context manager in the second commit. I'm happy to remove this commit and make it a follow up PR. Unfortunately, we cannot use the context manager in all places. The file handles are often created in a constructor. I think there are a few cases were get_handle could be moved up to to/read_ which would allow us to use a context manager.

contains a "b":
``df.to_csv(..., mode="wb")`` allows writing a CSV to a file object
opened binary mode. In most cases, it is not necessary to specify
``mode`` as Pandas will auto-detect whether the file object is
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet found an object that needs an explicit mode

self.ioargs.filepath_or_buffer = BytesIO(contents) # type: ignore[arg-type]
self.ioargs.should_close = True
self.path_or_buf = cast(BytesIO, self.ioargs.filepath_or_buffer)
contents = handles.handle.read()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why that is done. The file handle should always be in binary mode. If it wasn't in binary mode read would return a string which BytesIO would complain about. Maybe there is a performance reason (seeking faster in BytesIO than the provided buffer, just speculating)?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, this is way cleaner that before even. once more merge master and ping on green.

@jreback jreback merged commit 6d1541e into pandas-dev:master Nov 13, 2020
@jreback
Copy link
Contributor

jreback commented Nov 13, 2020

thanks @twoertwein very nice!

@arw2019
Copy link
Member

arw2019 commented Dec 10, 2020

@twoertwein I have a question about the code you added in this PR. In #31817 I'm adding a new csv reader engine (pyarrow) to which I have to provide BytesIO (StringIO is not accepted). I currently have a wrapper to handle this conversion:

class BytesIOWrapper:
    """
    Allows the pyarrow engine for read_csv() to read from string buffers
    """

    def __init__(self, string_buffer: StringIO, encoding: str = "utf-8"):
        self.string_buffer = string_buffer
        self.encoding = encoding

    def __getattr__(self, attr: str):
        return getattr(self.string_buffer, attr)

    def read(self, size: int = -1):
        content = self.string_buffer.read(size)
        return content.encode(self.encoding)

but I'm wondering if this can be done using pandas.io.common.get_handle? I have tried but couldn't quite get it to work. Is it possible with the code as is? If it's not is it something worth adding to get_handle?

Thanks so much!

@twoertwein
Copy link
Member Author

Yes, get_handle would be a good place for this and you probably wouldn't even need any new arguments (is_text and mode might be sufficient).

Towards the end of get_handle there is code for the opposite (wrap binary handles in a TextIOWapper). Adding an elif-block with your wrapper after the TextIOWrapper would be a good place. On the other side, it could also be beneficial to add your wrapper for any text handle(?) before applying compression as we currently do not support compression for text handles.

I assume that the following code might get you quite far

if not (is_text or _is_binary_mode(handle, mode)):
    handle = BytesIOWrapper(handle, encoding) 
    mode = mode.replace("b", "")

@twoertwein
Copy link
Member Author

@arw2019 I think that the BytesIOWrapper is on its own already really useful: I think many doc-strings state that the function accepts any file handle but in fact it is in most cases either text or a binary handle.

About adding the BytesIOWrapper to get_handle: you probbly also need to make sure that the wrapper is itself not added to created_handles, otherwise close will be called on the wrapper which then will call close on the underlying buffer. If the buffer is provided by a user, we shouldn't close it. If the buffer is created by us, it is already in created_handle and will be closed. Any easier solution might be to make close a no-op, then you (or a later part in get_handle) can add it to created_handles.

@arw2019
Copy link
Member

arw2019 commented Dec 11, 2020

@twoertwein Thank you so much for these responses!!!

I'm thinking I'll do a separate PR to add the BytesIOWrapper class and integrate it into get_handle (as #31817 is already quite bloated). I'll cc you on that and would love your feedback if you have time to look!

@arw2019 arw2019 mentioned this pull request Dec 31, 2020
5 tasks
akx added a commit to akx/pandas that referenced this pull request Oct 10, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 14, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG:Cannot write as xlsx to GCS
5 participants