Skip to content

Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -991,6 +991,7 @@ I/O
- Bug in :meth:`~SQLDatabase.execute` was raising a ``ProgrammingError`` for some DB-API drivers when the SQL statement contained the `%` character and no parameters were present (:issue:`34211`)
- Bug in :meth:`~pandas.io.stata.StataReader` which resulted in categorical variables with difference dtypes when reading data using an iterator. (:issue:`31544`)
- :meth:`HDFStore.keys` has now an optional `include` parameter that allows the retrieval of all native HDF5 table names (:issue:`29916`)
- `pandas.io.parquet.PyArrowImpl` now infers `filesystem` using the provided `path` if `filesystem` is not provided via `kwargs`. (:issue:`34841`)

Plotting
^^^^^^^^
Expand Down
6 changes: 5 additions & 1 deletion pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,19 +104,23 @@ def write(
from_pandas_kwargs["preserve_index"] = index

table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

fs = kwargs.pop("filesystem", get_fs_for_path(path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should wait till: https://github.com/pandas-dev/pandas/pull/34266/files#diff-0d7b5a2c72b4dfc11d80afe159d45ff8L153 is merged. Since this method is changing. @TomAugspurger might have thoughts

Copy link
Author

@kylase kylase Jun 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that #34266 is a more robust solution, which should be in v1.1?

This PR could be a patch for 1.0.x as to_parquet with partition_cols specified is not working as designed. Without partition_cols it works as intended without the need to specify the file system explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We original backported this functionality to 1.0.x but it was then reverted in #34632 -> So there is no s3 directory functionality on 1.0.x currently. See whatsnew: https://pandas.pydata.org/docs/whatsnew/v1.0.5.html#fixed-regressions. So think this will have to wait for 1.1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good. I will alter this after #34266 is merged.


# write_to_dataset does not support a file-like object when
# a directory path is used, so just pass the path string.
if partition_cols is not None:
self.api.parquet.write_to_dataset(
table,
path,
compression=compression,
filesystem=fs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user provides a filesystem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I realised this after looking at the tests that failed.

Perhaps, filesystem should be one of the keyword arguments instead of getting it from kwargs?

I think we need to update the doc to mention that filesystem is needed if the system can't obtain the underlying credentials from e.g. ~/.aws/credentials?

partition_cols=partition_cols,
**kwargs,
)
else:
self.api.parquet.write_table(
table, file_obj_or_path, compression=compression, **kwargs
table, file_obj_or_path, compression=compression, **kwargs,
)
if should_close:
file_obj_or_path.close()
Expand Down
18 changes: 18 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,24 @@ def test_s3_roundtrip_for_dir(self, df_compat, s3_resource, pa, partition_col):
repeat=1,
)

@td.skip_if_no("s3fs")
@pytest.mark.parametrize("partition_col", [["A"], []])
def test_s3_roundtrip_for_dir_infer_fs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks identical to https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py - could we parametrize instead here?

self, df_compat, s3_resource, pa, partition_col
):
expected_df = df_compat.copy()
if partition_col:
expected_df[partition_col] = expected_df[partition_col].astype("category")
check_round_trip(
df_compat,
pa,
expected=expected_df,
path="s3://pandas-test/parquet_dir",
write_kwargs={"partition_cols": partition_col, "compression": None},
check_like=True,
repeat=1,
)

@tm.network
@td.skip_if_no("pyarrow")
def test_parquet_read_from_url(self, df_compat):
Expand Down