Skip to content

Read csv headers #37966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Dec 15, 2020
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
bb3e8e6
storage_options as headers and tests added
Nov 14, 2020
db51474
additional tests - gzip, test additional headers receipt
Nov 15, 2020
6f901b8
bailed on using threading for testing
Nov 19, 2020
3af6a3d
clean up comments add json http tests
Nov 19, 2020
bad5739
Merge branch 'master' into read_csv_headers to update
Nov 19, 2020
8f5a0f1
added documentation on storage_options for headers
Nov 19, 2020
9fcc72a
DOC:Added doc for custom HTTP headers in read_csv and read_json
Nov 19, 2020
df6e539
DOC:Corrected versionadded tag and added issue number for reference
Nov 21, 2020
98db1c4
DOC:updated storage_options documentation
Nov 21, 2020
f28f36c
TST:updated with tm.assert_frame_equal
Nov 21, 2020
dd3265f
TST:fixed incorrect usage of tm.assert_frame_equal
Nov 21, 2020
02fc840
CLN:reordered imports to fix pre-commit error
Nov 21, 2020
da97f0a
DOC:changed whatsnew and added to shared_docs.py GH36688
Nov 22, 2020
fce4b17
ENH: read nonfsspec URL with headers built from storage_options GH36688
Nov 22, 2020
e0cfcb6
TST:Added additional tests parquet and other read methods GH36688
Nov 22, 2020
33115b7
TST:removed mocking in favor of threaded http server
Dec 3, 2020
5a1c64e
DOC:refined storage_options doscstring
Dec 3, 2020
018a399
Merge branch 'master' into read_csv_headers
cdknox Dec 3, 2020
87d7dc6
CLN:used the github editor and had pep8 issues
Dec 3, 2020
64a0d19
CLN: leftover comment removed
Dec 3, 2020
1724e9b
TST:attempted to address test warning of unclosed socket GH36688
Dec 3, 2020
f8b8c43
TST:added pytest.importorskip to handle the two main parquet engines …
Dec 3, 2020
a17d574
CLN: imports moved to correct order GH36688
Dec 3, 2020
eed8915
TST:fix fastparquet tests GH36688
Dec 3, 2020
75573a4
CLN:removed blank line at end of docstring GH36688
Dec 3, 2020
dc596c6
CLN:removed excess newlines GH36688
Dec 3, 2020
e27e3a9
CLN:fixed flake8 issues GH36688
Dec 4, 2020
734c9d3
TST:renamed a test that was getting clobbered and fixed the logic GH3…
Dec 4, 2020
8a5c5a3
CLN:try to silence mypy error via renaming GH36688
Dec 4, 2020
978d94a
TST:pytest.importorfail replaced with pytest.skip GH36688
Dec 4, 2020
807eb25
TST:content of dataframe on error made more useful GH36688
Dec 4, 2020
44c2869
CLN:fixed flake8 error GH36688
Dec 4, 2020
01ce3ae
TST: windows fastparquet error needs raised for troubleshooting GH36688
Dec 4, 2020
13bc775
CLN:fix for flake8 GH36688
Dec 4, 2020
6915517
TST:changed compression used in to_parquet from 'snappy' to None GH36688
Dec 4, 2020
186b0a4
TST:allowed exceptions to be raised via removing a try except block G…
Dec 4, 2020
88e9600
TST:replaced try except with pytest.importorskip GH36688
Dec 4, 2020
2a05d0f
CLN:removed dict() in favor of {} GH36688
Dec 13, 2020
d38a813
Merge branch 'master' into read_csv_headers
Dec 13, 2020
268e06a
DOC: changed potentially included version from 1.2.0 to 1.3.0 GH36688
Dec 13, 2020
565197f
TST:user agent tests moved from test_common to their own file GH36688
Dec 13, 2020
842e594
TST: used fsspec instead of patching bytesio GH36688
Dec 13, 2020
c0c3d34
TST: added importorskip for fsspec on FastParquet test GH36688
Dec 13, 2020
7025abb
TST:added missing importorskip to fsspec in another test GH36688
Dec 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1625,6 +1625,18 @@ functions - the following example shows reading a CSV file:

df = pd.read_csv("https://download.bls.gov/pub/time.series/cu/cu.item", sep="\t")

A custom header can be sent alongside HTTP(s) requests by passing a dictionary
of header key value mappings to the ``storage_options`` keyword argument as shown below:

.. code-block:: python

headers = {"User-Agent": "pandas"}
df = pd.read_csv(
"https://download.bls.gov/pub/time.series/cu/cu.item",
sep="\t",
storage_options=headers
)

All URLs which are not local files or HTTP(s) are handled by
`fsspec`_, if installed, and its various filesystem implementations
(including Amazon S3, Google Cloud, SSH, FTP, webHDFS...).
Expand Down
17 changes: 17 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,23 @@ Additionally ``mean`` supports execution via `Numba <https://numba.pydata.org/>`
the ``engine`` and ``engine_kwargs`` arguments. Numba must be installed as an optional dependency
to use this feature.

.. _whatsnew_120.read_csv_json_http_headers:

Custom HTTP(s) headers when reading csv or json files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:meth:`read_csv` and :meth:`read_json` use the dictionary passed to ``storage_options`` to create custom HTTP(s) headers.
For example:

.. ipython:: python

headers = {"User-Agent": "pandas"}
df = pd.read_csv(
"https://download.bls.gov/pub/time.series/cu/cu.item",
sep="\t",
storage_options=headers
)

.. _whatsnew_120.enhancements.other:

Other enhancements
Expand Down
17 changes: 11 additions & 6 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,12 +288,17 @@ def _get_filepath_or_buffer(
fsspec_mode += "b"

if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
# TODO: fsspec can also handle HTTP via requests, but leaving this unchanged
if storage_options:
raise ValueError(
"storage_options passed with file object or non-fsspec file path"
)
req = urlopen(filepath_or_buffer)
# TODO: fsspec can also handle HTTP via requests, but leaving this
# unchanged. using fsspec appears to break the ability to infer if the
# server responded with gzipped data
storage_options = storage_options or dict()
# waiting until now for importing to match intended lazy logic of
# urlopen function defined elsewhere in this module
import urllib.request

# assuming storage_options is to be interpretted as headers
req = urllib.request.Request(filepath_or_buffer, headers=storage_options)
req = urlopen(req)
content_encoding = req.headers.get("Content-Encoding", None)
if content_encoding == "gzip":
# Override compression based on Content-Encoding header
Expand Down
153 changes: 152 additions & 1 deletion pandas/tests/io/test_common.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
"""
Tests for the pandas.io.common functionalities
"""
from io import StringIO
import gzip
from io import StringIO, BytesIO
import mmap
import os
from pathlib import Path
Expand All @@ -16,6 +17,8 @@

import pandas.io.common as icom

from unittest.mock import MagicMock, patch


class CustomFSPath:
"""For testing fspath on unknown objects"""
Expand Down Expand Up @@ -411,3 +414,151 @@ def test_is_fsspec_url():
assert not icom.is_fsspec_url("random:pandas/somethingelse.com")
assert not icom.is_fsspec_url("/local/path")
assert not icom.is_fsspec_url("relative/local/path")


def test_plain_text_read_csv_http_custom_headers():
true_df = pd.DataFrame({"column_name": ["column_value"]})
df_csv_bytes = true_df.to_csv(index=False).encode("utf-8")
headers = {
"User-Agent": "custom",
"Auth": "other_custom",
}

class DummyResponse:
headers = {
"Content-Type": "text/csv",
}

@staticmethod
def read():
return df_csv_bytes

@staticmethod
def close():
pass

def dummy_response_getter(url):
return DummyResponse()

dummy_request = MagicMock()
with patch("urllib.request.Request", new=dummy_request):
with patch("urllib.request.urlopen", new=dummy_response_getter):
received_df = pd.read_csv(
"http://localhost:80/test.csv", storage_options=headers
)
assert dummy_request.called_with(headers=headers)
assert (received_df == true_df).all(axis=None)


def test_gzip_read_csv_http_custom_headers():
true_df = pd.DataFrame({"column_name": ["column_value"]})
df_csv_bytes = true_df.to_csv(index=False).encode("utf-8")
headers = {
"User-Agent": "custom",
"Auth": "other_custom",
}

class DummyResponse:
headers = {
"Content-Type": "text/csv",
"Content-Encoding": "gzip",
}

@staticmethod
def read():
bio = BytesIO()
zipper = gzip.GzipFile(fileobj=bio, mode="w")
zipper.write(df_csv_bytes)
zipper.close()
gzipped_response = bio.getvalue()
return gzipped_response

@staticmethod
def close():
pass

def dummy_response_getter(url):
return DummyResponse()

dummy_request = MagicMock()
with patch("urllib.request.Request", new=dummy_request):
with patch("urllib.request.urlopen", new=dummy_response_getter):
received_df = pd.read_csv(
"http://localhost:80/test.csv", storage_options=headers
)
assert dummy_request.called_with(headers=headers)
assert (received_df == true_df).all(axis=None)


def test_plain_text_read_json_http_custom_headers():
true_df = pd.DataFrame({"column_name": ["column_value"]})
df_json_bytes = true_df.to_json().encode("utf-8")
headers = {
"User-Agent": "custom",
"Auth": "other_custom",
}

class DummyResponse:
headers = {
"Content-Type": "application/json",
}

@staticmethod
def read():
return df_json_bytes

@staticmethod
def close():
pass

def dummy_response_getter(url):
return DummyResponse()

dummy_request = MagicMock()
with patch("urllib.request.Request", new=dummy_request):
with patch("urllib.request.urlopen", new=dummy_response_getter):
received_df = pd.read_json(
"http://localhost:80/test.json", storage_options=headers
)
assert dummy_request.called_with(headers=headers)
assert (received_df == true_df).all(axis=None)


def test_gzip_read_json_http_custom_headers():
true_df = pd.DataFrame({"column_name": ["column_value"]})
df_json_bytes = true_df.to_json().encode("utf-8")
headers = {
"User-Agent": "custom",
"Auth": "other_custom",
}

class DummyResponse:
headers = {
"Content-Type": "application/json",
"Content-Encoding": "gzip",
}

@staticmethod
def read():
bio = BytesIO()
zipper = gzip.GzipFile(fileobj=bio, mode="w")
zipper.write(df_json_bytes)
zipper.close()
gzipped_response = bio.getvalue()
return gzipped_response

@staticmethod
def close():
pass

def dummy_response_getter(url):
return DummyResponse()

dummy_request = MagicMock()
with patch("urllib.request.Request", new=dummy_request):
with patch("urllib.request.urlopen", new=dummy_response_getter):
received_df = pd.read_json(
"http://localhost:80/test.json", storage_options=headers
)
assert dummy_request.called_with(headers=headers)
assert (received_df == true_df).all(axis=None)