Skip to content

Read csv headers #37966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Dec 15, 2020
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
bb3e8e6
storage_options as headers and tests added
Nov 14, 2020
db51474
additional tests - gzip, test additional headers receipt
Nov 15, 2020
6f901b8
bailed on using threading for testing
Nov 19, 2020
3af6a3d
clean up comments add json http tests
Nov 19, 2020
bad5739
Merge branch 'master' into read_csv_headers to update
Nov 19, 2020
8f5a0f1
added documentation on storage_options for headers
Nov 19, 2020
9fcc72a
DOC:Added doc for custom HTTP headers in read_csv and read_json
Nov 19, 2020
df6e539
DOC:Corrected versionadded tag and added issue number for reference
Nov 21, 2020
98db1c4
DOC:updated storage_options documentation
Nov 21, 2020
f28f36c
TST:updated with tm.assert_frame_equal
Nov 21, 2020
dd3265f
TST:fixed incorrect usage of tm.assert_frame_equal
Nov 21, 2020
02fc840
CLN:reordered imports to fix pre-commit error
Nov 21, 2020
da97f0a
DOC:changed whatsnew and added to shared_docs.py GH36688
Nov 22, 2020
fce4b17
ENH: read nonfsspec URL with headers built from storage_options GH36688
Nov 22, 2020
e0cfcb6
TST:Added additional tests parquet and other read methods GH36688
Nov 22, 2020
33115b7
TST:removed mocking in favor of threaded http server
Dec 3, 2020
5a1c64e
DOC:refined storage_options doscstring
Dec 3, 2020
018a399
Merge branch 'master' into read_csv_headers
cdknox Dec 3, 2020
87d7dc6
CLN:used the github editor and had pep8 issues
Dec 3, 2020
64a0d19
CLN: leftover comment removed
Dec 3, 2020
1724e9b
TST:attempted to address test warning of unclosed socket GH36688
Dec 3, 2020
f8b8c43
TST:added pytest.importorskip to handle the two main parquet engines …
Dec 3, 2020
a17d574
CLN: imports moved to correct order GH36688
Dec 3, 2020
eed8915
TST:fix fastparquet tests GH36688
Dec 3, 2020
75573a4
CLN:removed blank line at end of docstring GH36688
Dec 3, 2020
dc596c6
CLN:removed excess newlines GH36688
Dec 3, 2020
e27e3a9
CLN:fixed flake8 issues GH36688
Dec 4, 2020
734c9d3
TST:renamed a test that was getting clobbered and fixed the logic GH3…
Dec 4, 2020
8a5c5a3
CLN:try to silence mypy error via renaming GH36688
Dec 4, 2020
978d94a
TST:pytest.importorfail replaced with pytest.skip GH36688
Dec 4, 2020
807eb25
TST:content of dataframe on error made more useful GH36688
Dec 4, 2020
44c2869
CLN:fixed flake8 error GH36688
Dec 4, 2020
01ce3ae
TST: windows fastparquet error needs raised for troubleshooting GH36688
Dec 4, 2020
13bc775
CLN:fix for flake8 GH36688
Dec 4, 2020
6915517
TST:changed compression used in to_parquet from 'snappy' to None GH36688
Dec 4, 2020
186b0a4
TST:allowed exceptions to be raised via removing a try except block G…
Dec 4, 2020
88e9600
TST:replaced try except with pytest.importorskip GH36688
Dec 4, 2020
2a05d0f
CLN:removed dict() in favor of {} GH36688
Dec 13, 2020
d38a813
Merge branch 'master' into read_csv_headers
Dec 13, 2020
268e06a
DOC: changed potentially included version from 1.2.0 to 1.3.0 GH36688
Dec 13, 2020
565197f
TST:user agent tests moved from test_common to their own file GH36688
Dec 13, 2020
842e594
TST: used fsspec instead of patching bytesio GH36688
Dec 13, 2020
c0c3d34
TST: added importorskip for fsspec on FastParquet test GH36688
Dec 13, 2020
7025abb
TST:added missing importorskip to fsspec in another test GH36688
Dec 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1625,6 +1625,20 @@ functions - the following example shows reading a CSV file:

df = pd.read_csv("https://download.bls.gov/pub/time.series/cu/cu.item", sep="\t")

.. versionadded:: 1.2.0

A custom header can be sent alongside HTTP(s) requests by passing a dictionary
of header key value mappings to the ``storage_options`` keyword argument as shown below:

.. code-block:: python

headers = {"User-Agent": "pandas"}
df = pd.read_csv(
"https://download.bls.gov/pub/time.series/cu/cu.item",
sep="\t",
storage_options=headers
)

All URLs which are not local files or HTTP(s) are handled by
`fsspec`_, if installed, and its various filesystem implementations
(including Amazon S3, Google Cloud, SSH, FTP, webHDFS...).
Expand Down
20 changes: 20 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,26 @@ Additionally ``mean`` supports execution via `Numba <https://numba.pydata.org/>`
the ``engine`` and ``engine_kwargs`` arguments. Numba must be installed as an optional dependency
to use this feature.

.. _whatsnew_120.read_csv_json_http_headers:

Custom HTTP(s) headers when reading csv or json files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When reading from a remote URL that is not handled by fsspec (ie. HTTP and
HTTPS) the dictionary passed to ``storage_options`` will be used to create the
headers included in the request. This can be used to control the User-Agent
header or send other custom headers (:issue:`36688`).
For example:

.. ipython:: python

headers = {"User-Agent": "pandas"}
df = pd.read_csv(
"https://download.bls.gov/pub/time.series/cu/cu.item",
sep="\t",
storage_options=headers
)

.. _whatsnew_120.enhancements.other:

Other enhancements
Expand Down
9 changes: 4 additions & 5 deletions pandas/core/shared_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,8 +383,7 @@
"storage_options"
] = """storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a non-fsspec URL.
See the fsspec and backend storage implementation docs for the set of
allowed keys and values."""
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to ``urllib`` as header options. For other URLs (e.g.
starting with "s3://", and "gcs://") the key-value pairs are forwarded to
``fsspec``. Please see ``fsspec`` and ``urllib`` for more details."""
17 changes: 11 additions & 6 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,12 +276,17 @@ def _get_filepath_or_buffer(
fsspec_mode += "b"

if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
# TODO: fsspec can also handle HTTP via requests, but leaving this unchanged
if storage_options:
raise ValueError(
"storage_options passed with file object or non-fsspec file path"
)
req = urlopen(filepath_or_buffer)
# TODO: fsspec can also handle HTTP via requests, but leaving this
# unchanged. using fsspec appears to break the ability to infer if the
# server responded with gzipped data
storage_options = storage_options or dict()
# waiting until now for importing to match intended lazy logic of
# urlopen function defined elsewhere in this module
import urllib.request

# assuming storage_options is to be interpretted as headers
req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
req = urlopen(req_info)
content_encoding = req.headers.get("Content-Encoding", None)
if content_encoding == "gzip":
# Override compression based on Content-Encoding header
Expand Down
39 changes: 32 additions & 7 deletions pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,13 @@
from pandas import DataFrame, MultiIndex, get_option
from pandas.core import generic

from pandas.io.common import IOHandles, get_handle, is_fsspec_url, stringify_path
from pandas.io.common import (
IOHandles,
get_handle,
is_fsspec_url,
is_url,
stringify_path,
)


def get_engine(engine: str) -> "BaseImpl":
Expand Down Expand Up @@ -66,8 +72,10 @@ def _get_path_or_handle(
fs, path_or_handle = fsspec.core.url_to_fs(
path_or_handle, **(storage_options or {})
)
elif storage_options:
raise ValueError("storage_options passed with buffer or non-fsspec filepath")
elif storage_options and (not is_url(path_or_handle) or mode != "rb"):
# can't write to a remote url
# without making use of fsspec at the moment
raise ValueError("storage_options passed with buffer, or non-supported URL")

handles = None
if (
Expand All @@ -79,7 +87,9 @@ def _get_path_or_handle(
# use get_handle only when we are very certain that it is not a directory
# fsspec resources can also point to directories
# this branch is used for example when reading from non-fsspec URLs
handles = get_handle(path_or_handle, mode, is_text=False)
handles = get_handle(
path_or_handle, mode, is_text=False, storage_options=storage_options
)
fs = None
path_or_handle = handles.handle
return path_or_handle, handles, fs
Expand Down Expand Up @@ -307,7 +317,9 @@ def read(
# use get_handle only when we are very certain that it is not a directory
# fsspec resources can also point to directories
# this branch is used for example when reading from non-fsspec URLs
handles = get_handle(path, "rb", is_text=False)
handles = get_handle(
path, "rb", is_text=False, storage_options=storage_options
)
path = handles.handle
parquet_file = self.api.ParquetFile(path, **parquet_kwargs)

Expand Down Expand Up @@ -404,10 +416,12 @@ def to_parquet(
return None


@doc(storage_options=generic._shared_docs["storage_options"])
def read_parquet(
path,
engine: str = "auto",
columns=None,
storage_options: StorageOptions = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new parameter should maybe be added after use_nullable_dtypes

use_nullable_dtypes: bool = False,
**kwargs,
):
Expand All @@ -432,13 +446,18 @@ def read_parquet(
By file-like object, we refer to objects with a ``read()`` method,
such as a file handle (e.g. via builtin ``open`` function)
or ``StringIO``.
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
engine : {{'auto', 'pyarrow', 'fastparquet'}}, default 'auto'
Parquet library to use. If 'auto', then the option
``io.parquet.engine`` is used. The default ``io.parquet.engine``
behavior is to try 'pyarrow', falling back to 'fastparquet' if
'pyarrow' is unavailable.
columns : list, default=None
If not None, only these columns will be read from the file.

{storage_options}

.. versionadded:: 1.2.0

use_nullable_dtypes : bool, default False
If True, use dtypes that use ``pd.NA`` as missing value indicator
for the resulting DataFrame (only applicable for ``engine="pyarrow"``).
Expand All @@ -448,6 +467,7 @@ def read_parquet(
support dtypes) may change without notice.

.. versionadded:: 1.2.0

**kwargs
Any additional kwargs are passed to the engine.

Expand All @@ -456,6 +476,11 @@ def read_parquet(
DataFrame
"""
impl = get_engine(engine)

return impl.read(
path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
path,
columns=columns,
storage_options=storage_options,
use_nullable_dtypes=use_nullable_dtypes,
**kwargs,
)
Loading