-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842
Changes from all commits
fdb3d7c
a39e900
b30aebb
eacd613
38a50fc
7bd1ef7
2400420
5ce4ffd
649ce2b
ff444f4
720b6c0
a918c6e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -104,19 +104,23 @@ def write( | |
from_pandas_kwargs["preserve_index"] = index | ||
|
||
table = self.api.Table.from_pandas(df, **from_pandas_kwargs) | ||
|
||
fs = kwargs.pop("filesystem", get_fs_for_path(path)) | ||
|
||
# write_to_dataset does not support a file-like object when | ||
# a directory path is used, so just pass the path string. | ||
if partition_cols is not None: | ||
self.api.parquet.write_to_dataset( | ||
table, | ||
path, | ||
compression=compression, | ||
filesystem=fs, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if the user provides a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for catching this. I realised this after looking at the tests that failed. Perhaps, I think we need to update the doc to mention that |
||
partition_cols=partition_cols, | ||
**kwargs, | ||
) | ||
else: | ||
self.api.parquet.write_table( | ||
table, file_obj_or_path, compression=compression, **kwargs | ||
table, file_obj_or_path, compression=compression, **kwargs, | ||
) | ||
if should_close: | ||
file_obj_or_path.close() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -568,6 +568,24 @@ def test_s3_roundtrip_for_dir(self, df_compat, s3_resource, pa, partition_col): | |
repeat=1, | ||
) | ||
|
||
@td.skip_if_no("s3fs") | ||
@pytest.mark.parametrize("partition_col", [["A"], []]) | ||
def test_s3_roundtrip_for_dir_infer_fs( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks identical to https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py - could we parametrize instead here? |
||
self, df_compat, s3_resource, pa, partition_col | ||
): | ||
expected_df = df_compat.copy() | ||
if partition_col: | ||
expected_df[partition_col] = expected_df[partition_col].astype("category") | ||
check_round_trip( | ||
df_compat, | ||
pa, | ||
expected=expected_df, | ||
path="s3://pandas-test/parquet_dir", | ||
write_kwargs={"partition_cols": partition_col, "compression": None}, | ||
check_like=True, | ||
repeat=1, | ||
) | ||
|
||
@tm.network | ||
@td.skip_if_no("pyarrow") | ||
def test_parquet_read_from_url(self, df_compat): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should wait till: https://github.com/pandas-dev/pandas/pull/34266/files#diff-0d7b5a2c72b4dfc11d80afe159d45ff8L153 is merged. Since this method is changing. @TomAugspurger might have thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that #34266 is a more robust solution, which should be in v1.1?
This PR could be a patch for 1.0.x as
to_parquet
withpartition_cols
specified is not working as designed. Withoutpartition_cols
it works as intended without the need to specify the file system explicitly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We original backported this functionality to 1.0.x but it was then reverted in #34632 -> So there is no s3 directory functionality on 1.0.x currently. See whatsnew: https://pandas.pydata.org/docs/whatsnew/v1.0.5.html#fixed-regressions. So think this will have to wait for 1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good. I will alter this after #34266 is merged.