-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
IO: Fix parquet read from s3 directory #33632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
aa94fe7
a30c71a
b2747eb
a51757a
968f3b6
789f4ca
040763e
40f5889
753d647
e4dcdc3
bb21431
fb38932
c29befd
4f78fc5
4b2828b
463c2ea
dabfe58
dea95f3
ae76e42
4b48326
211c36e
4897a32
bba4040
5bc6327
ca89c21
0df818e
2a1a85c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,12 @@ | |
|
||
from pandas import DataFrame, get_option | ||
|
||
from pandas.io.common import get_filepath_or_buffer, is_gcs_url, is_s3_url | ||
from pandas.io.common import ( | ||
get_filepath_or_buffer, | ||
get_fs_for_path, | ||
is_gcs_url, | ||
is_s3_url, | ||
) | ||
|
||
|
||
def get_engine(engine: str) -> "BaseImpl": | ||
|
@@ -92,13 +97,15 @@ def write( | |
**kwargs, | ||
): | ||
self.validate_dataframe(df) | ||
path, _, _, should_close = get_filepath_or_buffer(path, mode="wb") | ||
file_obj_or_path, _, _, should_close = get_filepath_or_buffer(path, mode="wb") | ||
|
||
from_pandas_kwargs: Dict[str, Any] = {"schema": kwargs.pop("schema", None)} | ||
if index is not None: | ||
from_pandas_kwargs["preserve_index"] = index | ||
|
||
table = self.api.Table.from_pandas(df, **from_pandas_kwargs) | ||
# write_to_dataset does not support a file-like object when | ||
# a dircetory path is used, so just pass the path string. | ||
if partition_cols is not None: | ||
self.api.parquet.write_to_dataset( | ||
table, | ||
|
@@ -108,20 +115,18 @@ def write( | |
**kwargs, | ||
) | ||
else: | ||
self.api.parquet.write_table(table, path, compression=compression, **kwargs) | ||
self.api.parquet.write_table( | ||
table, file_obj_or_path, compression=compression, **kwargs | ||
) | ||
if should_close: | ||
path.close() | ||
file_obj_or_path.close() | ||
|
||
def read(self, path, columns=None, **kwargs): | ||
path, _, _, should_close = get_filepath_or_buffer(path) | ||
|
||
kwargs["use_pandas_metadata"] = True | ||
result = self.api.parquet.read_table( | ||
path, columns=columns, **kwargs | ||
).to_pandas() | ||
if should_close: | ||
path.close() | ||
|
||
parquet_ds = self.api.parquet.ParquetDataset( | ||
path, filesystem=get_fs_for_path(path), **kwargs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pyarrow seems to only allow a file path opposed to a dir path. Removing filesystem arg here throws:
To repo see the test case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, OK. I see now in pyarrow that apparently string URIs with "s3://..." are not supported (while "hdfs://" is supported). That's something we should fix on the pyarrow side as well. But of course until then this is fine. |
||
) | ||
kwargs["columns"] = columns | ||
result = parquet_ds.read_pandas(**kwargs).to_pandas() | ||
return result | ||
|
||
|
||
|
@@ -273,7 +278,7 @@ def read_parquet(path, engine: str = "auto", columns=None, **kwargs): | |
A file URL can also be a path to a directory that contains multiple | ||
partitioned parquet files. Both pyarrow and fastparquet support | ||
paths to directories as well as file URLs. A directory path could be: | ||
``file://localhost/path/to/tables`` | ||
``file://localhost/path/to/tables`` or ``s3://bucket/partition_dir`` | ||
|
||
If you want to pass in a path object, pandas accepts any | ||
``os.PathLike``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alimcmaster1
This change breaks clients that pass a file-like object for
path
.ParquetDataset
doesn't provide the same file-like object handling that the originalget_filepath_or_buffer
did.Here's the call stack I'm seeing:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed bug report #34467