Skip to content

Storage options #35381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Aug 10, 2020
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,12 @@ SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_.
The existing capability to interface with S3 and GCS will be unaffected by this
change, as ``fsspec`` will still bring in the same packages as before.

Many read/write functions have acquired the `storage_options` optional argument,
to pass a dictionary of parameters to the storage backend. This allows, for
example, for passing credentials to S3 and GCS storage. The details of what
parameters can be passed to which backends can be found in the documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a lint from fsspec docs to the storage back ends?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the individual storage backends (linked from the fsspec docs).

.. _Azure Datalake and Blob: https://github.com/dask/adlfs

.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/
Expand Down
18 changes: 18 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1224,3 +1224,21 @@ def sort_by_key(request):
Tests None (no key) and the identity key.
"""
return request.param


@pytest.fixture()
def fsspectest():
pytest.importorskip("fsspec")
from fsspec.implementations.memory import MemoryFileSystem
from fsspec import register_implementation

class TestMemoryFS(MemoryFileSystem):
protocol = "testmem"
test = [None]

def __init__(self, **kwargs):
self.test[0] = kwargs.pop("test", None)
super().__init__(**kwargs)

register_implementation("testmem", TestMemoryFS, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword for True?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to deregister the implementation as a teardown? Does any state leak between tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but done now anyway. The tests were only seeing that the .test[0] changes, always with a new value.

return TestMemoryFS()
36 changes: 33 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2055,6 +2055,7 @@ def to_stata(
version: Optional[int] = 114,
convert_strl: Optional[Sequence[Label]] = None,
compression: Union[str, Mapping[str, str], None] = "infer",
storage_options: Optional[Dict[str, Any]] = None,
) -> None:
"""
Export DataFrame object to Stata dta format.
Expand Down Expand Up @@ -2131,6 +2132,16 @@ def to_stata(

.. versionadded:: 1.1.0

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.2.0


Raises
------
NotImplementedError
Expand Down Expand Up @@ -2187,6 +2198,7 @@ def to_stata(
write_index=write_index,
variable_labels=variable_labels,
compression=compression,
storage_options=storage_options,
**kwargs,
)
writer.write_file()
Expand Down Expand Up @@ -2239,9 +2251,10 @@ def to_feather(self, path, **kwargs) -> None:
)
def to_markdown(
self,
buf: Optional[IO[str]] = None,
mode: Optional[str] = None,
buf: Optional[Union[IO[str], str]] = None,
mode: str = "wt",
index: bool = True,
storage_options: Optional[Dict[str, Any]] = None,
**kwargs,
) -> Optional[str]:
if "showindex" in kwargs:
Expand All @@ -2259,9 +2272,14 @@ def to_markdown(
result = tabulate.tabulate(self, **kwargs)
if buf is None:
return result
buf, _, _, _ = get_filepath_or_buffer(buf, mode=mode)
buf, _, _, should_close = get_filepath_or_buffer(
buf, mode=mode, storage_options=storage_options
)
assert buf is not None # Help mypy.
assert not isinstance(buf, str)
buf.writelines(result)
if should_close:
buf.close()
return None

@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
Expand All @@ -2272,6 +2290,7 @@ def to_parquet(
compression: Optional[str] = "snappy",
index: Optional[bool] = None,
partition_cols: Optional[List[str]] = None,
storage_options: Optional[Dict[str, Any]] = None,
**kwargs,
) -> None:
"""
Expand Down Expand Up @@ -2320,6 +2339,16 @@ def to_parquet(

.. versionadded:: 0.24.0

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.1.0

**kwargs
Additional arguments passed to the parquet library. See
:ref:`pandas io <io.parquet>` for more details.
Expand Down Expand Up @@ -2366,6 +2395,7 @@ def to_parquet(
compression=compression,
index=index,
partition_cols=partition_cols,
storage_options=storage_options,
**kwargs,
)

Expand Down
41 changes: 40 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2042,6 +2042,7 @@ def to_json(
compression: Optional[str] = "infer",
index: bool_t = True,
indent: Optional[int] = None,
storage_options: Optional[Dict[str, Any]] = None,
) -> Optional[str]:
"""
Convert the object to a JSON string.
Expand Down Expand Up @@ -2125,6 +2126,16 @@ def to_json(

.. versionadded:: 1.0.0

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.1.0

Returns
-------
None or str
Expand Down Expand Up @@ -2303,6 +2314,7 @@ def to_json(
compression=compression,
index=index,
indent=indent,
storage_options=storage_options,
)

def to_hdf(
Expand Down Expand Up @@ -2617,6 +2629,7 @@ def to_pickle(
path,
compression: Optional[str] = "infer",
protocol: int = pickle.HIGHEST_PROTOCOL,
storage_options: Optional[Dict[str, Any]] = None,
) -> None:
"""
Pickle (serialize) object to file.
Expand All @@ -2637,6 +2650,16 @@ def to_pickle(

.. [1] https://docs.python.org/3/library/pickle.html.

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.1.0

See Also
--------
read_pickle : Load pickled pandas object (or any object) from file.
Expand Down Expand Up @@ -2670,7 +2693,13 @@ def to_pickle(
"""
from pandas.io.pickle import to_pickle

to_pickle(self, path, compression=compression, protocol=protocol)
to_pickle(
self,
path,
compression=compression,
protocol=protocol,
storage_options=storage_options,
)

def to_clipboard(
self, excel: bool_t = True, sep: Optional[str] = None, **kwargs
Expand Down Expand Up @@ -3010,6 +3039,7 @@ def to_csv(
escapechar: Optional[str] = None,
decimal: Optional[str] = ".",
errors: str = "strict",
storage_options: Optional[Dict[str, Any]] = None,
) -> Optional[str]:
r"""
Write object to a comma-separated values (csv) file.
Expand Down Expand Up @@ -3109,6 +3139,14 @@ def to_csv(
See the errors argument for :func:`open` for a full list
of options.

storage_options : dict, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if there is a way to share doc-strings components for all of these i/o methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of a way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think we can do this with our shared docs infra, but out of scope for now

Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.1.0

Returns
Expand Down Expand Up @@ -3163,6 +3201,7 @@ def to_csv(
doublequote=doublequote,
escapechar=escapechar,
decimal=decimal,
storage_options=storage_options,
)
formatter.save()

Expand Down
4 changes: 2 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1421,7 +1421,7 @@ def to_string(
def to_markdown(
self,
buf: Optional[IO[str]] = None,
mode: Optional[str] = None,
mode: str = "wt",
index: bool = True,
**kwargs,
) -> Optional[str]:
Expand All @@ -1435,7 +1435,7 @@ def to_markdown(
buf : str, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
mode : str, optional
Mode in which file is opened.
Mode in which file is opened, "wt" by default.
index : bool, optional, default True
Add index (row) labels.

Expand Down
20 changes: 18 additions & 2 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,16 @@ def get_filepath_or_buffer(
compression : {{'gzip', 'bz2', 'zip', 'xz', None}}, optional
encoding : the encoding to use to decode bytes, default is 'utf-8'
mode : str, optional
storage_options: dict, optional
passed on to fsspec, if using it; this is not yet accessed by the public API

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.1.0

Returns
-------
Expand All @@ -180,6 +188,10 @@ def get_filepath_or_buffer(

if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
# TODO: fsspec can also handle HTTP via requests, but leaving this unchanged
if storage_options:
raise ValueError(
"storage_options passed with file object or non-fsspec file path"
)
req = urlopen(filepath_or_buffer)
content_encoding = req.headers.get("Content-Encoding", None)
if content_encoding == "gzip":
Expand Down Expand Up @@ -234,6 +246,10 @@ def get_filepath_or_buffer(
).open()

return file_obj, encoding, compression, True
elif storage_options:
raise ValueError(
"storage_options passed with file object or non-fsspec file path"
)

if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)):
return _expand_user(filepath_or_buffer), None, compression, False
Expand Down
24 changes: 19 additions & 5 deletions pandas/io/feather_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,27 @@

from pandas import DataFrame, Int64Index, RangeIndex

from pandas.io.common import get_filepath_or_buffer, stringify_path
from pandas.io.common import get_filepath_or_buffer


def to_feather(df: DataFrame, path, **kwargs):
def to_feather(df: DataFrame, path, storage_options=None, **kwargs):
"""
Write a DataFrame to the binary Feather format.

Parameters
----------
df : DataFrame
path : string file path, or file-like object

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.1.0
**kwargs :
Additional keywords passed to `pyarrow.feather.write_feather`.

Expand All @@ -23,7 +33,9 @@ def to_feather(df: DataFrame, path, **kwargs):
import_optional_dependency("pyarrow")
from pyarrow import feather

path = stringify_path(path)
path, _, _, should_close = get_filepath_or_buffer(
path, mode="wb", storage_options=storage_options
)

if not isinstance(df, DataFrame):
raise ValueError("feather only support IO with DataFrames")
Expand Down Expand Up @@ -64,7 +76,7 @@ def to_feather(df: DataFrame, path, **kwargs):
feather.write_feather(df, path, **kwargs)


def read_feather(path, columns=None, use_threads: bool = True):
def read_feather(path, columns=None, use_threads: bool = True, storage_options=None):
"""
Load a feather-format object from the file path.

Expand Down Expand Up @@ -98,7 +110,9 @@ def read_feather(path, columns=None, use_threads: bool = True):
import_optional_dependency("pyarrow")
from pyarrow import feather

path, _, _, should_close = get_filepath_or_buffer(path)
path, _, _, should_close = get_filepath_or_buffer(
path, storage_options=storage_options
)

df = feather.read_feather(path, columns=columns, use_threads=bool(use_threads))

Expand Down
9 changes: 7 additions & 2 deletions pandas/io/formats/csvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import csv as csvlib
from io import StringIO
import os
from typing import Hashable, List, Mapping, Optional, Sequence, Union
from typing import Any, Dict, Hashable, List, Mapping, Optional, Sequence, Union
import warnings
from zipfile import ZipFile

Expand Down Expand Up @@ -54,6 +54,7 @@ def __init__(
doublequote: bool = True,
escapechar: Optional[str] = None,
decimal=".",
storage_options: Optional[Dict[str, Any]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add this to typing.py, e.g. StorageOptions=....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving it for now, but can do as you suggest if you request it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing as we are doing this in lots of places I would add it

):
self.obj = obj

Expand All @@ -64,7 +65,11 @@ def __init__(
compression, self.compression_args = get_compression_method(compression)

self.path_or_buf, _, _, self.should_close = get_filepath_or_buffer(
path_or_buf, encoding=encoding, compression=compression, mode=mode
path_or_buf,
encoding=encoding,
compression=compression,
mode=mode,
storage_options=storage_options,
)
self.sep = sep
self.na_rep = na_rep
Expand Down
Loading