restrict columns to read for pandas.read_parquet #18155

hoffmann · 2017-11-07T20:09:57Z

[ x ] closes Enable to restrict columns for pandas.read_parquet #18154
[ x ] tests added / passed
[ x ] passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

gfyoung · 2017-11-07T20:14:22Z

pandas/tests/io/test_parquet.py

@@ -282,6 +282,17 @@ def test_compression(self, engine, compression):
        df = pd.DataFrame({'A': [1, 2, 3]})
        self.check_round_trip(df, engine, compression=compression)

+    def test_read_columns(self, engine, fp):
+        df = pd.DataFrame({'string': list('abc'),


Reference issue number above.

gfyoung · 2017-11-07T20:14:51Z

pandas/io/parquet.py

@@ -188,6 +188,8 @@ def read_parquet(path, engine='auto', **kwargs):
    ----------
    path : string
        File path
+    columns: list


Write out what the default is too i.e. "list, default None"

gfyoung · 2017-11-07T20:15:21Z

Will need a whatsnew note in 0.22.0

pep8speaks · 2017-11-07T20:21:51Z

Hello @hoffmann! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 08, 2017 at 13:04 Hours UTC

jreback

doc comments, lgtm. otherwise.

jreback · 2017-11-07T20:55:08Z

doc/source/whatsnew/v0.22.0.txt

@@ -109,7 +109,7 @@ I/O
 ^^^

 - :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)
-
+- :func:`read_parquet` now allows to specify the columns to read from a parquet file (:issue:`18154`)


you can put this on 0.21.1

jreback · 2017-11-07T20:55:22Z

pandas/io/parquet.py

@@ -188,6 +188,8 @@ def read_parquet(path, engine='auto', **kwargs):
    ----------
    path : string
        File path
+    columns: list, default=None


add a version added tag

jreback · 2017-11-07T20:56:21Z

doc/source/whatsnew/v0.22.0.txt

@@ -109,7 +109,7 @@ I/O
 ^^^

 - :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)


can you add a small example in the docs in io.rst as well.

Done 21c5f5e

codecov · 2017-11-08T01:12:41Z

Codecov Report

Merging #18155 into master will increase coverage by <.01%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master   #18155      +/-   ##
==========================================
+ Coverage   91.41%   91.41%   +<.01%     
==========================================
  Files         163      163              
  Lines       50132    50132              
==========================================
+ Hits        45827    45830       +3     
+ Misses       4305     4302       -3

Flag	Coverage Δ
#multiple	`89.23% <83.33%> (+0.02%)`	⬆️
#single	`40.33% <50%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parquet.py	`65.38% <83.33%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.2% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488db6f...21c5f5e. Read the comment docs.

codecov · 2017-11-08T01:12:47Z

Codecov Report

Merging #18155 into master will decrease coverage by <.01%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master   #18155      +/-   ##
==========================================
- Coverage   91.41%    91.4%   -0.01%     
==========================================
  Files         163      163              
  Lines       50132    50068      -64     
==========================================
- Hits        45827    45767      -60     
+ Misses       4305     4301       -4

Flag	Coverage Δ
#multiple	`89.21% <83.33%> (+0.01%)`	⬆️
#single	`40.33% <50%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parquet.py	`65.38% <83.33%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️
pandas/tseries/offsets.py	`97.11% <0%> (-0.05%)`	⬇️
pandas/core/indexes/datetimes.py	`95.48% <0%> (-0.04%)`	⬇️
pandas/core/indexes/timedeltas.py	`91.17% <0%> (-0.02%)`	⬇️
pandas/core/nanops.py	`96.67% <0%> (ø)`	⬆️
pandas/core/indexes/datetimelike.py	`97.11% <0%> (ø)`	⬆️
pandas/core/indexes/period.py	`92.89% <0%> (+0.01%)`	⬆️
pandas/core/tools/timedeltas.py	`98.41% <0%> (+0.02%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488db6f...4b22c88. Read the comment docs.

jreback · 2017-11-08T02:24:02Z

pandas/io/parquet.py

        path, _, _ = get_filepath_or_buffer(path)
-        return self.api.parquet.read_table(path).to_pandas()
+        return self.api.parquet.read_table(path, columns).to_pandas()


pass columns as a kwarg to read_table and to_pandas

jreback · 2017-11-08T02:24:26Z

pandas/io/parquet.py

@@ -188,6 +188,9 @@ def read_parquet(path, engine='auto', **kwargs):
    ----------
    path : string
        File path
+    columns: list, default=None
+        If not None, only these columns will be read from the file.
+        .. versionadded 0.21.1


i think u need a blank line before the version added tag

jreback · 2017-11-08T02:24:36Z

pandas/io/parquet.py

@@ -201,4 +204,4 @@ def read_parquet(path, engine='auto', **kwargs):
    """

    impl = get_engine(engine)
-    return impl.read(path)
+    return impl.read(path, columns)


jreback · 2017-11-08T02:29:16Z

pandas/tests/io/test_parquet.py

+        df = pd.DataFrame({'string': list('abc'),
+                           'int': list(range(1, 4))})
+
+        with tm.ensure_clean() as path:


you don’t need the fp argument here: engine cycles thru both engines
use check_round_trip; pass in the expected (and the columns kwarg)

jreback

see comments

jreback · 2017-11-08T10:53:33Z

you have a linting issue

jreback · 2017-11-08T11:53:36Z

pandas/io/parquet.py

        path, _, _ = get_filepath_or_buffer(path)
-        return self.api.parquet.read_table(path).to_pandas()
+        return self.api.parquet.read_table(path, columns=columns).to_pandas()


i’d like to pass thru kwargs as well; these won’t be specific names args just pass thru to the engine to validate
for both fp and pyarrow
could just be a simple test with row_groups

Ok, i think it is good to pass explicit options like columns which are supported by both backends and also pass the kwargs to be able to provide additional engine specific kwargs.

Have to look at the test case.

ok that’s fine
really want row_group support :) (next PR!)
also if u want: #17102

jreback · 2017-11-08T12:05:02Z

ping on green

hoffmann · 2017-11-08T15:29:32Z

@jreback green.

If it's ok I'd like to do the change to accept **kwargs in the read() function in a different pull request because it will require to rewrite https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py#L191 to be able to handle **kwargs for to_parquet and read_parquet at the same time.

jreback · 2017-11-08T20:10:29Z

If it's ok I'd like to do the change to accept **kwargs in the read() function in a different pull request because it will require to rewrite https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py#L191 to be able to handle **kwargs for to_parquet and read_parquet at the same time.

yep ok by me

jreback · 2017-11-08T20:11:08Z

doc/source/io.rst

@@ -4538,6 +4538,16 @@ Read from a parquet file.

   result.dtypes

+Read only certain columns of a parquet file. 
+
+.. ipython:: python


in next PR, can you add a version added tag here (for 0.21.1)

jreback · 2017-11-08T20:11:35Z

thanks!

(cherry picked from commit 5128fe6)

hoffmann added 2 commits November 7, 2017 21:00

implement to read only columns from parquet file

8c247c2

fix flake8

d00d222

gfyoung added Enhancement IO CSV read_csv, to_csv IO Parquet parquet, feather and removed IO CSV read_csv, to_csv labels Nov 7, 2017

gfyoung reviewed Nov 7, 2017

View reviewed changes

reference issue in tests, clarify default in docstring

c1449f5

hoffmann added 2 commits November 7, 2017 21:23

fix pep8

22663e8

add whatsnew entry

f31e6a2

jreback requested changes Nov 7, 2017

View reviewed changes

hoffmann added 2 commits November 7, 2017 22:08

add feature to version v0.21.1

f91f5f8

add documentation how to read columns from parquet file

21c5f5e

jreback reviewed Nov 8, 2017

View reviewed changes

jreback requested changes Nov 8, 2017

View reviewed changes

hoffmann added 4 commits November 8, 2017 08:49

use keyword argument to pass columns

ef30f39

use check_round_tip in tests

54fc1c9

pep8

e5336b6

add newline before versionadded

d6baa9d

fix lint problem

7f6e7f6

jreback requested changes Nov 8, 2017

View reviewed changes

jreback added this to the 0.21.1 milestone Nov 8, 2017

jreback approved these changes Nov 8, 2017

View reviewed changes

fix lint error

4b22c88

jreback reviewed Nov 8, 2017

View reviewed changes

jreback merged commit 5128fe6 into pandas-dev:master Nov 8, 2017

watercrossing pushed a commit to watercrossing/pandas that referenced this pull request Nov 10, 2017

restrict columns to read for pandas.read_parquet (pandas-dev#18155)

164f032

criemen mentioned this pull request Nov 10, 2017

Pass kwargs from read_parquet() to the underlying engines. #18216

Merged

4 tasks

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

restrict columns to read for pandas.read_parquet (pandas-dev#18155)

5943291

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017

restrict columns to read for pandas.read_parquet (pandas-dev#18155)

f7d0b9f

(cherry picked from commit 5128fe6)

TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017

restrict columns to read for pandas.read_parquet (#18155)

50ff9e3

(cherry picked from commit 5128fe6)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restrict columns to read for pandas.read_parquet #18155

restrict columns to read for pandas.read_parquet #18155

hoffmann commented Nov 7, 2017 •

edited

Loading

gfyoung Nov 7, 2017

gfyoung Nov 7, 2017

gfyoung commented Nov 7, 2017

pep8speaks commented Nov 7, 2017 •

edited

Loading

jreback left a comment

jreback Nov 7, 2017

jreback Nov 7, 2017

hoffmann Nov 7, 2017

jreback Nov 7, 2017

hoffmann Nov 7, 2017

codecov bot commented Nov 8, 2017

codecov bot commented Nov 8, 2017 •

edited

Loading

jreback Nov 8, 2017

jreback Nov 8, 2017

jreback Nov 8, 2017

jreback Nov 8, 2017

jreback left a comment

jreback commented Nov 8, 2017

jreback Nov 8, 2017

hoffmann Nov 8, 2017 •

edited

Loading

jreback Nov 8, 2017

jreback commented Nov 8, 2017

hoffmann commented Nov 8, 2017

jreback commented Nov 8, 2017

jreback Nov 8, 2017

jreback commented Nov 8, 2017

		@@ -109,7 +109,7 @@ I/O
		^^^

		- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)

restrict columns to read for pandas.read_parquet #18155

restrict columns to read for pandas.read_parquet #18155

Conversation

hoffmann commented Nov 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Nov 7, 2017

pep8speaks commented Nov 7, 2017 • edited Loading

Comment last updated on November 08, 2017 at 13:04 Hours UTC

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 8, 2017

Codecov Report

codecov bot commented Nov 8, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Nov 8, 2017

Choose a reason for hiding this comment

hoffmann Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 8, 2017

hoffmann commented Nov 8, 2017

jreback commented Nov 8, 2017

Choose a reason for hiding this comment

jreback commented Nov 8, 2017

hoffmann commented Nov 7, 2017 •

edited

Loading

pep8speaks commented Nov 7, 2017 •

edited

Loading

codecov bot commented Nov 8, 2017 •

edited

Loading

hoffmann Nov 8, 2017 •

edited

Loading