-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Pass kwargs from read_parquet() to the underlying engines. #18216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass kwargs from read_parquet() to the underlying engines. #18216
Conversation
This allows e.g. to specify filters for predicate pushdown to fastparquet.
Hello @Corni! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on November 14, 2017 at 13:47 Hours UTC |
@hoffmann Is this what you intended? |
need tests |
is there a reasonable API that we could expose here to users to facilitate some sort of row group filttering (w/o resorting to a full query language or multiple functions)? |
@jreback https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 sounds reasonable. We should be able to implement that in |
Created https://issues.apache.org/jira/browse/ARROW-1796 and https://issues.apache.org/jira/browse/PARQUET-1158 to do this in |
…er (which only makes sense for writes).
@jreback How do you imagine tests for this feature, which would not rely on testing backend-specific parameters and their behauviour? |
@Corni actually I would have a couple of tests that pass thru kwargs specifically for the backend (IOW separate tests). just to make sure things are passed thru. e.g. write a file with row_groups and exercise reading with row_groups. (the kwargs for this type of predicate pushdown hopefully will be synchronized at some point, but for now you can have tests for each engine separately). |
@@ -105,7 +105,7 @@ def test_options_py(df_compat, pa): | |||
with pd.option_context('io.parquet.engine', 'pyarrow'): | |||
df.to_parquet(path) | |||
|
|||
result = read_parquet(path, compression=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you are removing the kw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, compression only exists for writes, as you specify the compression when writing the data, on read you have to un-compress with whatever algorithm was used when writing the file.
Before my patch, the kw was silently dropped, now it caused exceptions, because neither backend uses it.
Codecov Report
@@ Coverage Diff @@
## master #18216 +/- ##
==========================================
- Coverage 91.42% 91.36% -0.07%
==========================================
Files 163 164 +1
Lines 50068 49884 -184
==========================================
- Hits 45777 45575 -202
- Misses 4291 4309 +18
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18216 +/- ##
==========================================
- Coverage 91.42% 91.36% -0.07%
==========================================
Files 163 164 +1
Lines 50068 49880 -188
==========================================
- Hits 45777 45571 -206
- Misses 4291 4309 +18
Continue to review full report at Codecov.
|
I included now I included now a test for filtering rowgroups with fastparquet, though there is no kwargs in pyarrow yet which I could include in a test (see https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html), so I'm not testing the pyarrow yet. |
@@ -188,18 +188,21 @@ def check_error_on_write(self, df, engine, exc): | |||
with tm.ensure_clean() as path: | |||
to_parquet(df, path, engine, compression=None) | |||
|
|||
def check_round_trip(self, df, engine, expected=None, **kwargs): | |||
|
|||
def check_round_trip(self, df, engine, expected=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would refactor this helper function to have the following signature:
def check_round_trip(self, df, engine, expected=None, write_kwargs=None, read_kwargs=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is definitly the way to go.
Tests until now only worked because pyarrow.parquet.write_table ignores extra kwargs, and the fastparquet implementation did not pass kwargs through to the write method - else the tests would have failed on the (read-only) parameter columns already.
@Corni thanks, just the minor test issue otherwise like I intended it. |
lgtm. ping on green. |
+1 |
ping :) |
Can you update the docstring for this new functionality? |
No matter, it already seems to (incorrectly) have been there .. |
@Corni Thanks! |
…as-dev#18216) This allows e.g. to specify filters for predicate pushdown to fastparquet. (cherry picked from commit ef4e30b)
This allows e.g. to specify filters for predicate pushdown to fastparquet. (cherry picked from commit ef4e30b)
This allows e.g. to specify filters for predicate pushdown to fastparquet.
This is a followup to #18155/#18154
git diff upstream/master -u -- "*.py" | flake8 --diff