Skip to content

Pass kwargs from read_parquet() to the underlying engines. #18216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

criemen
Copy link

@criemen criemen commented Nov 10, 2017

This allows e.g. to specify filters for predicate pushdown to fastparquet.
This is a followup to #18155/#18154

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This allows e.g. to specify filters for predicate pushdown to fastparquet.
@pep8speaks
Copy link

pep8speaks commented Nov 10, 2017

Hello @Corni! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 14, 2017 at 13:47 Hours UTC

@criemen
Copy link
Author

criemen commented Nov 10, 2017

@hoffmann Is this what you intended?

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

need tests

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

@wesm @cpcloud @martindurant

is there a reasonable API that we could expose here to users to facilitate some sort of row group filttering (w/o resorting to a full query language or multiple functions)?

@jreback jreback added the IO Parquet parquet, feather label Nov 10, 2017
@xhochy
Copy link
Contributor

xhochy commented Nov 10, 2017

@jreback https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 sounds reasonable. We should be able to implement that in pyarrow quite easily.

@xhochy
Copy link
Contributor

xhochy commented Nov 10, 2017

@criemen
Copy link
Author

criemen commented Nov 13, 2017

@jreback How do you imagine tests for this feature, which would not rely on testing backend-specific parameters and their behauviour?

@jreback
Copy link
Contributor

jreback commented Nov 13, 2017

@Corni actually I would have a couple of tests that pass thru kwargs specifically for the backend (IOW separate tests). just to make sure things are passed thru. e.g. write a file with row_groups and exercise reading with row_groups. (the kwargs for this type of predicate pushdown hopefully will be synchronized at some point, but for now you can have tests for each engine separately).

@@ -105,7 +105,7 @@ def test_options_py(df_compat, pa):
with pd.option_context('io.parquet.engine', 'pyarrow'):
df.to_parquet(path)

result = read_parquet(path, compression=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are removing the kw?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, compression only exists for writes, as you specify the compression when writing the data, on read you have to un-compress with whatever algorithm was used when writing the file.
Before my patch, the kw was silently dropped, now it caused exceptions, because neither backend uses it.

@codecov
Copy link

codecov bot commented Nov 13, 2017

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49884     -184     
==========================================
- Hits        45777    45575     -202     
- Misses       4291     4309      +18
Flag Coverage Δ
#multiple 89.16% <80%> (-0.06%) ⬇️
#single 39.42% <40%> (-1%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <80%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/clipboard/clipboards.py 24.05% <0%> (-2.54%) ⬇️
pandas/tseries/frequencies.py 94.09% <0%> (-1.92%) ⬇️
pandas/plotting/_converter.py 63.44% <0%> (-1.77%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/categorical.py 95.75% <0%> (-0.05%) ⬇️
pandas/core/indexes/timedeltas.py 91.14% <0%> (-0.04%) ⬇️
pandas/core/dtypes/concat.py 99.13% <0%> (-0.02%) ⬇️
pandas/core/api.py 100% <0%> (ø) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...1c045a5. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 13, 2017

Codecov Report

Merging #18216 into master will decrease coverage by 0.06%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18216      +/-   ##
==========================================
- Coverage   91.42%   91.36%   -0.07%     
==========================================
  Files         163      164       +1     
  Lines       50068    49880     -188     
==========================================
- Hits        45777    45571     -206     
- Misses       4291     4309      +18
Flag Coverage Δ
#multiple 89.16% <83.33%> (-0.06%) ⬇️
#single 39.42% <33.33%> (-1%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <83.33%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/clipboard/clipboards.py 24.05% <0%> (-2.54%) ⬇️
pandas/tseries/frequencies.py 94.09% <0%> (-1.92%) ⬇️
pandas/plotting/_converter.py 63.44% <0%> (-1.77%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/categorical.py 95.75% <0%> (-0.05%) ⬇️
pandas/core/indexes/timedeltas.py 91.14% <0%> (-0.04%) ⬇️
pandas/core/indexes/multi.py 96.38% <0%> (-0.02%) ⬇️
pandas/core/dtypes/concat.py 99.13% <0%> (-0.02%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9050e38...edbd937. Read the comment docs.

@criemen
Copy link
Author

criemen commented Nov 13, 2017

I included now I included now a test for filtering rowgroups with fastparquet, though there is no kwargs in pyarrow yet which I could include in a test (see https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html), so I'm not testing the pyarrow yet.

@@ -188,18 +188,21 @@ def check_error_on_write(self, df, engine, exc):
with tm.ensure_clean() as path:
to_parquet(df, path, engine, compression=None)

def check_round_trip(self, df, engine, expected=None, **kwargs):

def check_round_trip(self, df, engine, expected=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would refactor this helper function to have the following signature:

def check_round_trip(self, df, engine, expected=None, write_kwargs=None, read_kwargs=None)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is definitly the way to go.
Tests until now only worked because pyarrow.parquet.write_table ignores extra kwargs, and the fastparquet implementation did not pass kwargs through to the write method - else the tests would have failed on the (read-only) parameter columns already.

@hoffmann
Copy link
Contributor

@Corni thanks, just the minor test issue otherwise like I intended it.

@jreback jreback added this to the 0.21.1 milestone Nov 14, 2017
@jreback
Copy link
Contributor

jreback commented Nov 14, 2017

lgtm. ping on green.

@wesm @xhochy @martindurant

@martindurant
Copy link
Contributor

+1

@criemen
Copy link
Author

criemen commented Nov 14, 2017

ping :)

@jorisvandenbossche
Copy link
Member

Can you update the docstring for this new functionality?

@jorisvandenbossche
Copy link
Member

Can you update the docstring for this new functionality?

No matter, it already seems to (incorrectly) have been there ..

@jorisvandenbossche jorisvandenbossche merged commit ef4e30b into pandas-dev:master Nov 14, 2017
@jorisvandenbossche
Copy link
Member

@Corni Thanks!

@criemen criemen deleted the read-parquet-improvements branch November 14, 2017 15:43
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
…as-dev#18216)

This allows e.g. to specify filters for predicate pushdown to fastparquet.

(cherry picked from commit ef4e30b)
TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017
This allows e.g. to specify filters for predicate pushdown to fastparquet.

(cherry picked from commit ef4e30b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants