Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
IO: Fix parquet read from s3 directory #33632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO: Fix parquet read from s3 directory #33632
Changes from 23 commits
aa94fe7
a30c71a
b2747eb
a51757a
968f3b6
789f4ca
040763e
40f5889
753d647
e4dcdc3
bb21431
fb38932
c29befd
4f78fc5
4b2828b
463c2ea
dabfe58
dea95f3
ae76e42
4b48326
211c36e
4897a32
bba4040
5bc6327
ca89c21
0df818e
2a1a85c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you review the doc-strings to see if they need updating (e.g. may need a versionadded tag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet Docs strings indicate we already supported this I think? I updated the whatsnew and added an example in docs strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alimcmaster1
This change breaks clients that pass a file-like object for
path
.ParquetDataset
doesn't provide the same file-like object handling that the originalget_filepath_or_buffer
did.Here's the call stack I'm seeing:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed bug report #34467
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this
filesystem=get_fs_for_path(path)
needed? What happens if you just pass the path? (which I assume has eg as3://..
in it?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow seems to only allow a file path opposed to a dir path. Removing filesystem arg here throws:
To repo see the test case
test_s3_roundtrip_for_dir
I wrote belowThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I see now in pyarrow that apparently string URIs with "s3://..." are not supported (while "hdfs://" is supported). That's something we should fix on the pyarrow side as well. But of course until then this is fine.