Skip to content

Storage options #35381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Aug 10, 2020
Merged

Storage options #35381

merged 33 commits into from
Aug 10, 2020

Conversation

martindurant
Copy link
Contributor

@martindurant martindurant commented Jul 22, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@martindurant martindurant marked this pull request as draft July 22, 2020 15:47
@martindurant
Copy link
Contributor Author

martindurant commented Jul 23, 2020

Refs: #28377 , #30359 , #33639 , #16692 (and probably more)

@martindurant
Copy link
Contributor Author

This PR is near complete.
There may be more places where storage_options can be added, but the most important cases are covered and tested.

Question: the docstring for the kwarg should be added, but it would be repeated verbatim in every place; is this the right thing to do?

(cc @TomAugspurger )

@TomAugspurger
Copy link
Contributor

Question: the docstring for the kwarg should be added, but it would be repeated verbatim in every place; is this the right thing to do?

Yeah I think it needs to be duplicated. We do share some sharing between some readers, but not all. It'll be easier to just add the same lines to each, and add an example in one place (perhaps read_csv).

@TomAugspurger TomAugspurger added API Design IO Data IO issues that don't fit into a more specific label labels Jul 23, 2020
@TomAugspurger TomAugspurger added this to the 1.2 milestone Jul 23, 2020
buf: Optional[IO[str]] = None,
mode: Optional[str] = None,
buf: Optional[Union[IO[str], str]] = None,
mode: Optional[str] = "wt",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode=None doesn't actually make sense. This is for writing, and most open commands (including python builtin) would default to "rt" or "rb".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if changing this, could also remove Optional from type annotation.

@@ -234,6 +236,8 @@ def get_filepath_or_buffer(
).open()

return file_obj, encoding, compression, True
elif storage_options:
raise ValueError("storage_options passed with non-fsspec URL")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "URL" here actually a user-provided buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local file or HTTP URL; but I guess an already open file instance would end up here too, worth checking.

@martindurant
Copy link
Contributor Author

(fails against numpy dev with "ValueError: invalid array_struct" while reading stata)

@martindurant
Copy link
Contributor Author

Added text to io.rst

@martindurant
Copy link
Contributor Author

I disagree with "file-like objects must be iterable" in is_file_like - most uses (in pandas) require only read, and certainly for stata.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger
Copy link
Contributor

IIRC we've gone back and forth on exactly what is required. I don't especially care what is required, but I'd prefer to be consistent in what we accept across readers unless there's a good reason not to be. Hopefully @gfyoung can comment here.

@martindurant
Copy link
Contributor Author

isort is complaining, but passes for me locally. Can someone tell me what's wrong?

Check import format using isort
##[error]ERROR: /home/runner/work/pandas/pandas/pandas/conftest.py Imports are incorrectly sorted and/or formatted.
Check import format using isort DONE
##[error]Process completed with exit code 1.

code:

@pytest.fixture()
def fsspectest():
    pytest.importorskip("fsspec")
    from fsspec.implementations.memory import MemoryFileSystem
    from fsspec import register_implementation
    from fsspec.registry import _registry as registry

@TomAugspurger
Copy link
Contributor

Perhaps check that the version of isort matches?

Also some warnings in io.rst: https://github.com/pandas-dev/pandas/pull/35381/checks?check_run_id=954318397#step:6:2333.

@martindurant
Copy link
Contributor Author

@TomAugspurger you were right, newer isort solved it; also fixed the RST link.

@martindurant
Copy link
Contributor Author

ping

@TomAugspurger
Copy link
Contributor

My only outstanding question is #35381 (comment).

@martindurant
Copy link
Contributor Author

@gfyoung has not yet responded, and I would rather not require iterable files, nor change the utils code that other things might be depending on. I suppose I'm stuck for now! This particular case is only for stata, and the previous code didn't use is_file_like either.

(note that IOBase does guarantee iterability, but I'm not certain that all file-like implementations do - not that I can imagine why they wouldn't)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable a few comments

@@ -3126,6 +3156,16 @@ def to_csv(

.. versionadded:: 1.1.0

storage_options : dict, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if there is a way to share doc-strings components for all of these i/o methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of a way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think we can do this with our shared docs infra, but out of scope for now

@@ -54,6 +54,7 @@ def __init__(
doublequote: bool = True,
escapechar: Optional[str] = None,
decimal=".",
storage_options: Optional[Dict[str, Any]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing as we are doing this in lots of places I would add it

@@ -1931,7 +1937,9 @@ def read_stata(


def _open_file_binary_write(
fname: FilePathOrBuffer, compression: Union[str, Mapping[str, str], None],
fname: FilePathOrBuffer,
compression: Union[str, Mapping[str, str], None],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also add Compression to typing.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't me, but the styling; compression as a kwarg has different types in different places (most often str, but also Optional[Union[str, Mapping[str, Any]]], which is similar), so I don't think there's a useful common type to be made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm interesting, we ought to fix that. can you create an issue about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain it's wrong - sometimes the args is required, but could be None, other times it's not required, but the default value can be None or str (e.g., "infer").

Mainly defining the StorageOptions type
@@ -21,7 +21,8 @@ to pass a dictionary of parameters to the storage backend. This allows, for
example, for passing credentials to S3 and GCS storage. The details of what
parameters can be passed to which backends can be found in the documentation
of the individual storage backends (detailed from the fsspec docs for
`builtin implementations`_ and linked to `external ones`_).
`builtin implementations`_ and linked to `external ones`_). See
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these referenced in io.rst (more important that they are there), ok if they are here as well (but not really necessary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I phrased it a bit differently (one general link, one specific instead of two specific) - I'll make the two places more similar.

@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

##[error]doc/source/user_guide/io.rst:1706:83:E501:line too long (86 > 79 characters)

lint issue

@martindurant
Copy link
Contributor Author

lint issue

@jreback : fixed I think. Even linting the docs...

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to avoid horizontal scroll bars in the code snippets in the HTML rendered version.

@TomAugspurger TomAugspurger merged commit df1d440 into pandas-dev:master Aug 10, 2020
@TomAugspurger
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants