Pass kwargs from read_parquet() to the underlying engines. #18216

criemen · 2017-11-10T15:48:43Z

This allows e.g. to specify filters for predicate pushdown to fastparquet.
This is a followup to #18155/#18154

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This allows e.g. to specify filters for predicate pushdown to fastparquet.

pep8speaks · 2017-11-10T15:48:45Z

Hello @Corni! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 14, 2017 at 13:47 Hours UTC

criemen · 2017-11-10T15:51:56Z

@hoffmann Is this what you intended?

jreback · 2017-11-10T15:54:10Z

need tests

jreback · 2017-11-10T15:55:59Z

@wesm @cpcloud @martindurant

is there a reasonable API that we could expose here to users to facilitate some sort of row group filttering (w/o resorting to a full query language or multiple functions)?

xhochy · 2017-11-10T16:24:09Z

@jreback https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 sounds reasonable. We should be able to implement that in pyarrow quite easily.

xhochy · 2017-11-10T16:31:11Z

Created https://issues.apache.org/jira/browse/ARROW-1796 and https://issues.apache.org/jira/browse/PARQUET-1158 to do this in pyarrow.

…er (which only makes sense for writes).

criemen · 2017-11-13T12:41:48Z

@jreback How do you imagine tests for this feature, which would not rely on testing backend-specific parameters and their behauviour?

jreback · 2017-11-13T12:59:54Z

@Corni actually I would have a couple of tests that pass thru kwargs specifically for the backend (IOW separate tests). just to make sure things are passed thru. e.g. write a file with row_groups and exercise reading with row_groups. (the kwargs for this type of predicate pushdown hopefully will be synchronized at some point, but for now you can have tests for each engine separately).

jreback · 2017-11-13T13:00:30Z

pandas/tests/io/test_parquet.py

@@ -105,7 +105,7 @@ def test_options_py(df_compat, pa):
        with pd.option_context('io.parquet.engine', 'pyarrow'):
            df.to_parquet(path)

-            result = read_parquet(path, compression=None)


is there a reason you are removing the kw?

Yes, compression only exists for writes, as you specify the compression when writing the data, on read you have to un-compress with whatever algorithm was used when writing the file.
Before my patch, the kw was silently dropped, now it caused exceptions, because neither backend uses it.

codecov · 2017-11-13T15:06:09Z

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49884     -184     
==========================================
- Hits        45777    45575     -202     
- Misses       4291     4309      +18

Flag	Coverage Δ
#multiple	`89.16% <80%> (-0.06%)`	⬇️
#single	`39.42% <40%> (-1%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parquet.py	`65.38% <80%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/io/clipboard/clipboards.py	`24.05% <0%> (-2.54%)`	⬇️
pandas/tseries/frequencies.py	`94.09% <0%> (-1.92%)`	⬇️
pandas/plotting/_converter.py	`63.44% <0%> (-1.77%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️
pandas/core/categorical.py	`95.75% <0%> (-0.05%)`	⬇️
pandas/core/indexes/timedeltas.py	`91.14% <0%> (-0.04%)`	⬇️
pandas/core/dtypes/concat.py	`99.13% <0%> (-0.02%)`	⬇️
pandas/core/api.py	`100% <0%> (ø)`	⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...1c045a5. Read the comment docs.

codecov · 2017-11-13T15:07:02Z

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49880     -188     
==========================================
- Hits        45777    45571     -206     
- Misses       4291     4309      +18

Flag	Coverage Δ
#multiple	`89.16% <83.33%> (-0.06%)`	⬇️
#single	`39.42% <33.33%> (-1%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parquet.py	`65.38% <83.33%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/io/clipboard/clipboards.py	`24.05% <0%> (-2.54%)`	⬇️
pandas/tseries/frequencies.py	`94.09% <0%> (-1.92%)`	⬇️
pandas/plotting/_converter.py	`63.44% <0%> (-1.77%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️
pandas/core/categorical.py	`95.75% <0%> (-0.05%)`	⬇️
pandas/core/indexes/timedeltas.py	`91.14% <0%> (-0.04%)`	⬇️
pandas/core/indexes/multi.py	`96.38% <0%> (-0.02%)`	⬇️
pandas/core/dtypes/concat.py	`99.13% <0%> (-0.02%)`	⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...edbd937. Read the comment docs.

criemen · 2017-11-13T17:22:49Z

I included now I included now a test for filtering rowgroups with fastparquet, though there is no kwargs in pyarrow yet which I could include in a test (see https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html), so I'm not testing the pyarrow yet.

hoffmann · 2017-11-14T09:26:33Z

pandas/tests/io/test_parquet.py

@@ -188,18 +188,21 @@ def check_error_on_write(self, df, engine, exc):
            with tm.ensure_clean() as path:
                to_parquet(df, path, engine, compression=None)

-    def check_round_trip(self, df, engine, expected=None, **kwargs):
-
+    def check_round_trip(self, df, engine, expected=None,


I would refactor this helper function to have the following signature:

def check_round_trip(self, df, engine, expected=None, write_kwargs=None, read_kwargs=None)

Yeah, this is definitly the way to go.
Tests until now only worked because pyarrow.parquet.write_table ignores extra kwargs, and the fastparquet implementation did not pass kwargs through to the write method - else the tests would have failed on the (read-only) parameter columns already.

hoffmann · 2017-11-14T09:30:23Z

@Corni thanks, just the minor test issue otherwise like I intended it.

jreback · 2017-11-14T13:30:22Z

lgtm. ping on green.

@wesm @xhochy @martindurant

martindurant · 2017-11-14T14:01:32Z

+1

criemen · 2017-11-14T14:39:14Z

ping :)

jorisvandenbossche · 2017-11-14T15:37:34Z

Can you update the docstring for this new functionality?

jorisvandenbossche · 2017-11-14T15:39:34Z

Can you update the docstring for this new functionality?

No matter, it already seems to (incorrectly) have been there ..

jorisvandenbossche · 2017-11-14T15:40:21Z

@Corni Thanks!

…as-dev#18216) This allows e.g. to specify filters for predicate pushdown to fastparquet. (cherry picked from commit ef4e30b)

This allows e.g. to specify filters for predicate pushdown to fastparquet. (cherry picked from commit ef4e30b)

Pass kwargs from read_parquet() to the underlying engines.

ab26304

This allows e.g. to specify filters for predicate pushdown to fastparquet.

Cornelius Riemenschneider added 2 commits November 10, 2017 16:49

Add PR number to whatsnew file.

66e34fc

Fix PEP8 issue.

248a334

jreback added the IO Parquet parquet, feather label Nov 10, 2017

Fix wrong tests, which called read_parquet with a compression paramet…

1c045a5

…er (which only makes sense for writes).

jreback requested changes Nov 13, 2017

View reviewed changes

Cornelius Riemenschneider added 2 commits November 13, 2017 18:20

Actually pass kwargs from to_parquet to fastparquet.

61748d8

Test filtering the row groups for fastparquet.

243172d

hoffmann reviewed Nov 14, 2017

View reviewed changes

Refactor check_round_trip test method.

6187a7c

jreback added this to the 0.21.1 milestone Nov 14, 2017

jreback approved these changes Nov 14, 2017

View reviewed changes

xhochy approved these changes Nov 14, 2017

View reviewed changes

Fix linting errors.

edbd937

jorisvandenbossche merged commit ef4e30b into pandas-dev:master Nov 14, 2017

criemen deleted the read-parquet-improvements branch November 14, 2017 15:43

TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017

ENH: Pass kwargs from read_parquet() to the underlying engines. (#18216)

1a09eb5

This allows e.g. to specify filters for predicate pushdown to fastparquet. (cherry picked from commit ef4e30b)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass kwargs from read_parquet() to the underlying engines. #18216

Pass kwargs from read_parquet() to the underlying engines. #18216

criemen commented Nov 10, 2017 •

edited

Loading

pep8speaks commented Nov 10, 2017 •

edited

Loading

criemen commented Nov 10, 2017

jreback commented Nov 10, 2017

jreback commented Nov 10, 2017

xhochy commented Nov 10, 2017

xhochy commented Nov 10, 2017

criemen commented Nov 13, 2017

jreback commented Nov 13, 2017

jreback Nov 13, 2017

criemen Nov 13, 2017

codecov bot commented Nov 13, 2017 •

edited

Loading

codecov bot commented Nov 13, 2017 •

edited

Loading

criemen commented Nov 13, 2017

hoffmann Nov 14, 2017

criemen Nov 14, 2017

hoffmann commented Nov 14, 2017

jreback commented Nov 14, 2017

martindurant commented Nov 14, 2017

criemen commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

Pass kwargs from read_parquet() to the underlying engines. #18216

Pass kwargs from read_parquet() to the underlying engines. #18216

Conversation

criemen commented Nov 10, 2017 • edited Loading

pep8speaks commented Nov 10, 2017 • edited Loading

Comment last updated on November 14, 2017 at 13:47 Hours UTC

criemen commented Nov 10, 2017

jreback commented Nov 10, 2017

jreback commented Nov 10, 2017

xhochy commented Nov 10, 2017

xhochy commented Nov 10, 2017

criemen commented Nov 13, 2017

jreback commented Nov 13, 2017

jreback Nov 13, 2017

Choose a reason for hiding this comment

criemen Nov 13, 2017

Choose a reason for hiding this comment

codecov bot commented Nov 13, 2017 • edited Loading

Codecov Report

codecov bot commented Nov 13, 2017 • edited Loading

Codecov Report

criemen commented Nov 13, 2017

hoffmann Nov 14, 2017

Choose a reason for hiding this comment

criemen Nov 14, 2017

Choose a reason for hiding this comment

hoffmann commented Nov 14, 2017

jreback commented Nov 14, 2017

martindurant commented Nov 14, 2017

criemen commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

criemen commented Nov 10, 2017 •

edited

Loading

pep8speaks commented Nov 10, 2017 •

edited

Loading

codecov bot commented Nov 13, 2017 •

edited

Loading

codecov bot commented Nov 13, 2017 •

edited

Loading