REF: move get_filepath_buffer into get_handle #37639

twoertwein · 2020-11-05T03:10:46Z

closes BUG:Cannot write as xlsx to GCS #33987
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

This PR makes get_filepath_buffer a private function and is called inside get_handle - one function to rule all of IO ;)

This PR will make it easier for future PRs to make the IO-related interface of read/to_* more consistent as most of them should support compression/memory mapping (for reading)/(binary) file handles/storage options.

Notes to keep track of future follow-up PRs:

context manager for get_handle
storage_options for to_excel

pandas/io/common.py

pandas/io/parquet.py

pandas/io/excel/_base.py

jreback

wow this is a quite a lot of simplification. some questions and a couple requests for followup (don't add to this PR)

doc/source/whatsnew/v1.2.0.rst

pandas/core/frame.py

pandas/io/excel/_base.py

pandas/io/excel/_openpyxl.py

pandas/io/parquet.py

pandas/io/parsers.py

jreback · 2020-11-10T13:25:42Z

pandas/io/parsers.py

+
+        if isinstance(src, list):
+            self.handles = IOHandles(
+                handle=src, compression={"method": None}  # type: ignore[arg-type]


how does this work?

yeah, that isn't nice (especially from a typing perspective). The parsers seems to cope with objects implementing __next__ (for example lists). I think a solution would be to have a new attribute for the parser (maybe self.iterable : Optional[Iteratable]). If it is None, use self.handles otherwise use the iterable?

this branch is only used by read_excel and is only used for the PythonParser. I typed src and added comments about this special case.

I will try to fix this mess later this week.

I think the 'best' solution is to temporarily silence mypy and put a list in IOHandles. PythonParser (the only parser affected by that) had already a check whether the 'file handle' has readline, if not (for the list from read_excel) it treats it as raw data.

edit:
I found a better solution (well-typed and call get_handle after all potential raise(...) calls to avoid leaking file handles)

pandas/tests/io/test_fsspec.py

pandas/io/parsers.py

pep8speaks · 2020-11-12T08:37:26Z

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-13 06:00:27 UTC

twoertwein · 2020-11-12T12:09:53Z

I made IOHandles a context manager in the second commit. I'm happy to remove this commit and make it a follow up PR. Unfortunately, we cannot use the context manager in all places. The file handles are often created in a constructor. ~~I think there are a few cases were get_handle could be moved up to to/read_ which would allow us to use a context manager.~~

twoertwein · 2020-11-12T22:53:34Z

doc/source/user_guide/io.rst

-contains a "b":
+``df.to_csv(..., mode="wb")`` allows writing a CSV to a file object
+opened binary mode. In most cases, it is not necessary to specify
+``mode`` as Pandas will auto-detect whether the file object is


I haven't yet found an object that needs an explicit mode

twoertwein · 2020-11-12T23:22:19Z

pandas/io/stata.py

-            self.ioargs.filepath_or_buffer = BytesIO(contents)  # type: ignore[arg-type]
-            self.ioargs.should_close = True
-        self.path_or_buf = cast(BytesIO, self.ioargs.filepath_or_buffer)
+            contents = handles.handle.read()


I don't know why that is done. The file handle should always be in binary mode. If it wasn't in binary mode read would return a string which BytesIO would complain about. Maybe there is a performance reason (seeking faster in BytesIO than the provided buffer, just speculating)?

jreback

wow, this is way cleaner that before even. once more merge master and ping on green.

jreback · 2020-11-13T13:39:34Z

thanks @twoertwein very nice!

arw2019 · 2020-12-10T06:44:11Z

@twoertwein I have a question about the code you added in this PR. In #31817 I'm adding a new csv reader engine (pyarrow) to which I have to provide BytesIO (StringIO is not accepted). I currently have a wrapper to handle this conversion:

class BytesIOWrapper:
    """
    Allows the pyarrow engine for read_csv() to read from string buffers
    """

    def __init__(self, string_buffer: StringIO, encoding: str = "utf-8"):
        self.string_buffer = string_buffer
        self.encoding = encoding

    def __getattr__(self, attr: str):
        return getattr(self.string_buffer, attr)

    def read(self, size: int = -1):
        content = self.string_buffer.read(size)
        return content.encode(self.encoding)

but I'm wondering if this can be done using pandas.io.common.get_handle? I have tried but couldn't quite get it to work. Is it possible with the code as is? If it's not is it something worth adding to get_handle?

Thanks so much!

twoertwein · 2020-12-10T07:20:47Z

Yes, get_handle would be a good place for this and you probably wouldn't even need any new arguments (is_text and mode might be sufficient).

Towards the end of get_handle there is code for the opposite (wrap binary handles in a TextIOWapper). Adding an elif-block with your wrapper after the TextIOWrapper would be a good place. On the other side, it could also be beneficial to add your wrapper for any text handle(?) before applying compression as we currently do not support compression for text handles.

I assume that the following code might get you quite far

if not (is_text or _is_binary_mode(handle, mode)):
    handle = BytesIOWrapper(handle, encoding) 
    mode = mode.replace("b", "")

twoertwein · 2020-12-10T16:55:53Z

@arw2019 I think that the BytesIOWrapper is on its own already really useful: I think many doc-strings state that the function accepts any file handle but in fact it is in most cases either text or a binary handle.

About adding the BytesIOWrapper to get_handle: you probbly also need to make sure that the wrapper is itself not added to created_handles, otherwise close will be called on the wrapper which then will call close on the underlying buffer. If the buffer is provided by a user, we shouldn't close it. If the buffer is created by us, it is already in created_handle and will be closed. Any easier solution might be to make close a no-op, then you (or a later part in get_handle) can add it to created_handles.

arw2019 · 2020-12-11T04:34:30Z

@twoertwein Thank you so much for these responses!!!

I'm thinking I'll do a separate PR to add the BytesIOWrapper class and integrate it into get_handle (as #31817 is already quite bloated). I'll cc you on that and would love your feedback if you have time to look!

…straint due to renaming of `get_filepath_or_buffer` pandas-dev/pandas#37639 pandas-dev/pandas@6d1541e#diff-934d8564d648e7521db673c6399dcac98e45adfd5230ba47d3aabfcc21979febL247 PEtab-dev/PEtab#493

…fer` private without major release (semver!?) pandas-dev/pandas#37639 pandas-dev/pandas@6d1541e#diff-934d8564d648e7521db673c6399dcac98e45adfd5230ba47d3aabfcc21979febL247

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

twoertwein marked this pull request as ready for review November 5, 2020 08:31

twoertwein marked this pull request as draft November 5, 2020 19:41

jreback added IO Data IO issues that don't fit into a more specific label Refactor Internal refactoring of code labels Nov 6, 2020

twoertwein commented Nov 6, 2020

View reviewed changes

pandas/io/common.py Show resolved Hide resolved

twoertwein commented Nov 6, 2020

View reviewed changes

pandas/io/common.py Show resolved Hide resolved

twoertwein commented Nov 7, 2020

View reviewed changes

pandas/io/parquet.py Show resolved Hide resolved

twoertwein marked this pull request as ready for review November 7, 2020 19:06

twoertwein commented Nov 7, 2020

View reviewed changes

pandas/io/excel/_base.py Show resolved Hide resolved

jreback requested changes Nov 10, 2020

View reviewed changes

jreback added this to the 1.2 milestone Nov 10, 2020

twoertwein commented Nov 11, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

twoertwein commented Nov 12, 2020

View reviewed changes

jreback approved these changes Nov 13, 2020

View reviewed changes

twoertwein added 2 commits November 13, 2020 00:59

REF: move get_filepath_buffer into get_handle

7190290

make IOHandles a context manager

4f1fad8

jreback merged commit 6d1541e into pandas-dev:master Nov 13, 2020

twoertwein deleted the io branch November 13, 2020 14:45

This was referenced Nov 13, 2020

ENH: storage_options for to_excel #37818

Merged

CLN: remove unnecessary close calls and add a few necessary ones #37837

Merged

CLN: remove something.xlsx #37857

Merged

BUG: use compression=None (again) to avoid inferring compression #37909

Merged

dilpath mentioned this pull request Dec 29, 2020

fix for pandas 1.2.0 PEtab-dev/PEtab#493

Merged

fphammerle mentioned this pull request Dec 31, 2020

adapt to breaking change in pandas v1.2.0 making get_filepath_or_buffer private without major release (semver!?) fphammerle/freesurfer-stats#15

Merged

jotasi mentioned this pull request Dec 31, 2020

TYP: investigate/fix ignored mypy errors #37715

Closed

arw2019 mentioned this pull request Dec 31, 2020

[WIP] ENH: add Pyarrow csv engine #38370

Closed

5 tasks

arw2019 mentioned this pull request Jan 24, 2021

REF: in pandas.io.common integerate BytesIOWrapper into IOHandle #39383

Closed

twoertwein mentioned this pull request Feb 4, 2021

BUG: ExcelWriter with mode='a' corrupts file #39576

Closed

3 tasks

twoertwein mentioned this pull request Sep 24, 2021

TYP: use __all__ to signal public API to type checkers #43695

Merged

4 tasks

akx mentioned this pull request Oct 10, 2022

REGR: be able to read Stata files without reading them fully into memory #48922

Closed

7 tasks

akx added a commit to akx/pandas that referenced this pull request Oct 10, 2022

REGR: be able to read Stata files without reading them fully into memory

0c51920

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022

REGR: be able to read Stata files without reading them fully into memory

0c7bb6a

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022

REGR: be able to read Stata files without reading them fully into memory

300084d

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022

REGR: be able to read Stata files without reading them fully into memory

6d6f8d2

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022

REGR: be able to read Stata files without reading them fully into memory

45cee6f

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022

REGR: be able to read Stata files without reading them fully into memory

a7a9799

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx added a commit to akx/pandas that referenced this pull request Oct 14, 2022

REGR: be able to read Stata files without reading them fully into memory

b93d6fb

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx added a commit to akx/pandas that referenced this pull request Oct 15, 2022

REGR: be able to read Stata files without reading them fully into memory

169a3aa

Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e

akx mentioned this pull request Oct 21, 2022

CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: move get_filepath_buffer into get_handle #37639

REF: move get_filepath_buffer into get_handle #37639

twoertwein commented Nov 5, 2020 •

edited

Loading

jreback left a comment

jreback Nov 10, 2020

twoertwein Nov 10, 2020

twoertwein Nov 11, 2020

twoertwein Nov 11, 2020 •

edited

Loading

pep8speaks commented Nov 12, 2020 •

edited

Loading

twoertwein commented Nov 12, 2020 •

edited

Loading

twoertwein Nov 12, 2020

twoertwein Nov 12, 2020

jreback left a comment

jreback commented Nov 13, 2020

arw2019 commented Dec 10, 2020

twoertwein commented Dec 10, 2020

twoertwein commented Dec 10, 2020

arw2019 commented Dec 11, 2020

REF: move get_filepath_buffer into get_handle #37639

REF: move get_filepath_buffer into get_handle #37639

Conversation

twoertwein commented Nov 5, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Nov 10, 2020

Choose a reason for hiding this comment

twoertwein Nov 10, 2020

Choose a reason for hiding this comment

twoertwein Nov 11, 2020

Choose a reason for hiding this comment

twoertwein Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

pep8speaks commented Nov 12, 2020 • edited Loading

Comment last updated at 2020-11-13 06:00:27 UTC

twoertwein commented Nov 12, 2020 • edited Loading

twoertwein Nov 12, 2020

Choose a reason for hiding this comment

twoertwein Nov 12, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Nov 13, 2020

arw2019 commented Dec 10, 2020

twoertwein commented Dec 10, 2020

twoertwein commented Dec 10, 2020

arw2019 commented Dec 11, 2020

twoertwein commented Nov 5, 2020 •

edited

Loading

twoertwein Nov 11, 2020 •

edited

Loading

pep8speaks commented Nov 12, 2020 •

edited

Loading

twoertwein commented Nov 12, 2020 •

edited

Loading