Skip to content

IO: Fix S3 Error Handling #33645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 21, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion pandas/io/formats/csvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def __init__(
# Extract compression mode as given, if dict
compression, self.compression_args = get_compression_method(compression)

self.path_or_buf, _, _, _ = get_filepath_or_buffer(
self.path_or_buf, _, _, self.should_close = get_filepath_or_buffer(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger - you pointer here was the exact problem #32486 (comment)

path_or_buf, encoding=encoding, compression=compression, mode=mode
)
self.sep = sep
Expand Down Expand Up @@ -223,6 +223,8 @@ def save(self) -> None:
f.close()
for _fh in handles:
_fh.close()
elif self.should_close:
f.close()

def _save_header(self):
writer = self.writer
Expand Down
4 changes: 3 additions & 1 deletion pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ def write(
**kwargs,
):
self.validate_dataframe(df)
path, _, _, _ = get_filepath_or_buffer(path, mode="wb")
path, _, _, should_close = get_filepath_or_buffer(path, mode="wb")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason we only have this logic in read. Should add to write too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, can you open an issue for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For which bit? File basically needs to be closed post write for s3fs to throw as expected. @TomAugspurger mention it here: #32486 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason we only have this logic in read. Should add to write too

your comment here.


from_pandas_kwargs: Dict[str, Any] = {"schema": kwargs.pop("schema", None)}
if index is not None:
Expand All @@ -109,6 +109,8 @@ def write(
)
else:
self.api.parquet.write_table(table, path, compression=compression, **kwargs)
if should_close:
path.close()

def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
Expand Down
16 changes: 15 additions & 1 deletion pandas/tests/io/parser/test_network.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def test_parse_public_s3_bucket_nrows_python(self, tips_df):
assert not df.empty
tm.assert_frame_equal(tips_df.iloc[:10], df)

def test_s3_fails(self):
def test_read_s3_fails(self):
with pytest.raises(IOError):
read_csv("s3://nyqpug/asdf.csv")

Expand All @@ -168,6 +168,20 @@ def test_s3_fails(self):
with pytest.raises(IOError):
read_csv("s3://cant_get_it/file.csv")

def test_write_s3_csv_fails(self, tips_df):
# GH 32486
with pytest.raises(
FileNotFoundError, match="The specified bucket does not exist"
):
tips_df.to_csv("s3://an_s3_bucket_data_doesnt_exit/not_real.csv")

def test_write_s3_parquet_fails(self, tips_df):
# GH 27679
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need an importorskip, or do we raise prior to importing the engine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an importskip. Would raise ImportError if PyArrow or fastparquet isn’t installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the decorator version instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have done - and changed other importerskip in this file

with pytest.raises(
FileNotFoundError, match="The specified bucket does not exist"
):
tips_df.to_parquet("s3://an_s3_bucket_data_doesnt_exit/not_real.parquet")

def test_read_csv_handles_boto_s3_object(self, s3_resource, tips_file):
# see gh-16135

Expand Down
32 changes: 9 additions & 23 deletions pandas/tests/io/test_gcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,15 @@ def open(*args):

monkeypatch.setattr("gcsfs.GCSFileSystem", MockGCSFileSystem)
df1.to_csv("gs://test/test.csv", index=True)
df2 = read_csv(StringIO(s.getvalue()), parse_dates=["dt"], index_col=0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer works since file needs to be closed to validate credentials. #32486 (comment)


def mock_get_filepath_or_buffer(*args, **kwargs):
return StringIO(df1.to_csv()), None, None, False

monkeypatch.setattr(
"pandas.io.gcs.get_filepath_or_buffer", mock_get_filepath_or_buffer
)

df2 = read_csv("gs://test/test.csv", parse_dates=["dt"], index_col=0)

tm.assert_frame_equal(df1, df2)

Expand Down Expand Up @@ -86,28 +94,6 @@ def open(self, path, mode="r", *args):
)


@td.skip_if_no("gcsfs")
def test_gcs_get_filepath_or_buffer(monkeypatch):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since def test_to_csv_gcs now tests same functionality and uses identical monkeypatch.

df1 = DataFrame(
{
"int": [1, 3],
"float": [2.0, np.nan],
"str": ["t", "s"],
"dt": date_range("2018-06-18", periods=2),
}
)

def mock_get_filepath_or_buffer(*args, **kwargs):
return (StringIO(df1.to_csv(index=False)), None, None, False)

monkeypatch.setattr(
"pandas.io.gcs.get_filepath_or_buffer", mock_get_filepath_or_buffer
)
df2 = read_csv("gs://test/test.csv", parse_dates=["dt"])

tm.assert_frame_equal(df1, df2)


@td.skip_if_installed("gcsfs")
def test_gcs_not_present_exception():
with pytest.raises(ImportError) as e:
Expand Down