Skip to content

Storage options #35381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Aug 10, 2020
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 52 additions & 9 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1649,29 +1649,72 @@ options include:
Specifying any of the above options will produce a ``ParserWarning`` unless the
python engine is selected explicitly using ``engine='python'``.

Reading remote files
''''''''''''''''''''
.. _io.remote:

Reading/writing remote files
''''''''''''''''''''''''''''

You can pass in a URL to a CSV file:
You can pass in a URL to read or write remote files to many of Pandas' IO
functions - the following example shows reading a CSV file:

.. code-block:: python

df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',
sep='\t')

S3 URLs are handled as well but require installing the `S3Fs
All URLs which are not local files or HTTP(s) are handled by
`fsspec`_, if installed, and its various filesystem implementations
(including Amazon S3, Google Cloud, SSH, FTP, webHDFS...).
Some of these implementations will require additional packages to be
installed, for example
S3 URLs require the `s3fs
<https://pypi.org/project/s3fs/>`_ library:

.. code-block:: python

df = pd.read_csv('s3://pandas-test/tips.csv')
df = pd.read_json('s3://pandas-test/adatafile.json')

When dealing with remote storage systems, you might need
extra configuration with environment variables or config files in
special locations. For example, to access data in your S3 bucket,
you will need to define credentials in one of the several ways listed in
the `S3Fs documentation
<https://s3fs.readthedocs.io/en/latest/#credentials>`_. The same is true
for several of the storage backends, and you should follow the links
at `fsimpl1`_ for implementations built into ``fsspec`` and `fsimpl2`_
for those not included in the main ``fsspec``
distribution.

You can also pass parameters directly to the backend driver. For example,
if you do *not* have S3 credentials, you can still access public data by
specifying an anonymous connection, such as

.. versionadded:: 1.2.0

.. code-block:: python

pd.read_csv("s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
"-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
storage_options={"anon": True})

``fsspec`` also allows complex URLs, for accessing data in compressed
archives, local caching of files, and more. To locally cache the above
example, you would modify the call to

.. code-block:: python

If your S3 bucket requires credentials you will need to set them as environment
variables or in the ``~/.aws/credentials`` config file, refer to the `S3Fs
documentation on credentials
<https://s3fs.readthedocs.io/en/latest/#credentials>`_.
pd.read_csv("simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
"-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
storage_options={"s3": {"anon": True}})

where we specify that the "anon" parameter is meant for the "s3" part of
the implementation, not to the caching implementation. Note that this caches to a temporary
directory for the duration of the session only, but you can also specify
a permanent store.

.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/
.. _fsimpl1: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
.. _fsimpl2: https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations

Writing out data
''''''''''''''''
Expand Down
14 changes: 14 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,20 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

Passing arguments to fsspec backends
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Many read/write functions have acquired the ``storage_options`` optional argument,
to pass a dictionary of parameters to the storage backend. This allows, for
example, for passing credentials to S3 and GCS storage. The details of what
parameters can be passed to which backends can be found in the documentation
of the individual storage backends (detailed from the fsspec docs for
`builtin implementations`_ and linked to `external ones`_). See
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these referenced in io.rst (more important that they are there), ok if they are here as well (but not really necessary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I phrased it a bit differently (one general link, one specific instead of two specific) - I'll make the two places more similar.

Section :ref:`io.remote`.

.. _builtin implementations: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
.. _external ones: https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations

.. _whatsnew_120.binary_handle_to_csv:

Support for binary file handles in ``to_csv``
Expand Down
3 changes: 3 additions & 0 deletions pandas/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,6 @@
List[AggFuncTypeBase],
Dict[Label, Union[AggFuncTypeBase, List[AggFuncTypeBase]]],
]

# for arbitrary kwargs passed during reading/writing files
StorageOptions = Optional[Dict[str, Any]]
22 changes: 22 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1224,3 +1224,25 @@ def sort_by_key(request):
Tests None (no key) and the identity key.
"""
return request.param


@pytest.fixture()
def fsspectest():
pytest.importorskip("fsspec")
from fsspec import register_implementation
from fsspec.implementations.memory import MemoryFileSystem
from fsspec.registry import _registry as registry

class TestMemoryFS(MemoryFileSystem):
protocol = "testmem"
test = [None]

def __init__(self, **kwargs):
self.test[0] = kwargs.pop("test", None)
super().__init__(**kwargs)

register_implementation("testmem", TestMemoryFS, clobber=True)
yield TestMemoryFS()
registry.pop("testmem", None)
TestMemoryFS.test[0] = None
TestMemoryFS.store.clear()
37 changes: 34 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
Label,
Level,
Renamer,
StorageOptions,
ValueKeyFunc,
)
from pandas.compat import PY37
Expand Down Expand Up @@ -2056,6 +2057,7 @@ def to_stata(
version: Optional[int] = 114,
convert_strl: Optional[Sequence[Label]] = None,
compression: Union[str, Mapping[str, str], None] = "infer",
storage_options: StorageOptions = None,
) -> None:
"""
Export DataFrame object to Stata dta format.
Expand Down Expand Up @@ -2132,6 +2134,16 @@ def to_stata(

.. versionadded:: 1.1.0

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values.

.. versionadded:: 1.2.0

Raises
------
NotImplementedError
Expand Down Expand Up @@ -2192,6 +2204,7 @@ def to_stata(
write_index=write_index,
variable_labels=variable_labels,
compression=compression,
storage_options=storage_options,
**kwargs,
)
writer.write_file()
Expand Down Expand Up @@ -2244,9 +2257,10 @@ def to_feather(self, path, **kwargs) -> None:
)
def to_markdown(
self,
buf: Optional[IO[str]] = None,
mode: Optional[str] = None,
buf: Optional[Union[IO[str], str]] = None,
mode: str = "wt",
index: bool = True,
storage_options: StorageOptions = None,
**kwargs,
) -> Optional[str]:
if "showindex" in kwargs:
Expand All @@ -2264,9 +2278,14 @@ def to_markdown(
result = tabulate.tabulate(self, **kwargs)
if buf is None:
return result
buf, _, _, _ = get_filepath_or_buffer(buf, mode=mode)
buf, _, _, should_close = get_filepath_or_buffer(
buf, mode=mode, storage_options=storage_options
)
assert buf is not None # Help mypy.
assert not isinstance(buf, str)
buf.writelines(result)
if should_close:
buf.close()
return None

@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
Expand All @@ -2277,6 +2296,7 @@ def to_parquet(
compression: Optional[str] = "snappy",
index: Optional[bool] = None,
partition_cols: Optional[List[str]] = None,
storage_options: StorageOptions = None,
**kwargs,
) -> None:
"""
Expand Down Expand Up @@ -2325,6 +2345,16 @@ def to_parquet(

.. versionadded:: 0.24.0

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.2.0

**kwargs
Additional arguments passed to the parquet library. See
:ref:`pandas io <io.parquet>` for more details.
Expand Down Expand Up @@ -2371,6 +2401,7 @@ def to_parquet(
compression=compression,
index=index,
partition_cols=partition_cols,
storage_options=storage_options,
**kwargs,
)

Expand Down
44 changes: 43 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
Label,
Level,
Renamer,
StorageOptions,
TimedeltaConvertibleTypes,
TimestampConvertibleTypes,
ValueKeyFunc,
Expand Down Expand Up @@ -2042,6 +2043,7 @@ def to_json(
compression: Optional[str] = "infer",
index: bool_t = True,
indent: Optional[int] = None,
storage_options: StorageOptions = None,
) -> Optional[str]:
"""
Convert the object to a JSON string.
Expand Down Expand Up @@ -2125,6 +2127,16 @@ def to_json(

.. versionadded:: 1.0.0

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.2.0

Returns
-------
None or str
Expand Down Expand Up @@ -2303,6 +2315,7 @@ def to_json(
compression=compression,
index=index,
indent=indent,
storage_options=storage_options,
)

def to_hdf(
Expand Down Expand Up @@ -2617,6 +2630,7 @@ def to_pickle(
path,
compression: Optional[str] = "infer",
protocol: int = pickle.HIGHEST_PROTOCOL,
storage_options: StorageOptions = None,
) -> None:
"""
Pickle (serialize) object to file.
Expand All @@ -2637,6 +2651,16 @@ def to_pickle(

.. [1] https://docs.python.org/3/library/pickle.html.

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.2.0

See Also
--------
read_pickle : Load pickled pandas object (or any object) from file.
Expand Down Expand Up @@ -2670,7 +2694,13 @@ def to_pickle(
"""
from pandas.io.pickle import to_pickle

to_pickle(self, path, compression=compression, protocol=protocol)
to_pickle(
self,
path,
compression=compression,
protocol=protocol,
storage_options=storage_options,
)

def to_clipboard(
self, excel: bool_t = True, sep: Optional[str] = None, **kwargs
Expand Down Expand Up @@ -3015,6 +3045,7 @@ def to_csv(
escapechar: Optional[str] = None,
decimal: Optional[str] = ".",
errors: str = "strict",
storage_options: StorageOptions = None,
) -> Optional[str]:
r"""
Write object to a comma-separated values (csv) file.
Expand Down Expand Up @@ -3126,6 +3157,16 @@ def to_csv(

.. versionadded:: 1.1.0

storage_options : dict, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if there is a way to share doc-strings components for all of these i/o methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of a way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think we can do this with our shared docs infra, but out of scope for now

Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a local path or
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.2.0

Returns
-------
None or str
Expand Down Expand Up @@ -3178,6 +3219,7 @@ def to_csv(
doublequote=doublequote,
escapechar=escapechar,
decimal=decimal,
storage_options=storage_options,
)
formatter.save()

Expand Down
Loading