-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
restrict columns to read for pandas.read_parquet #18155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/tests/io/test_parquet.py
Outdated
@@ -282,6 +282,17 @@ def test_compression(self, engine, compression): | |||
df = pd.DataFrame({'A': [1, 2, 3]}) | |||
self.check_round_trip(df, engine, compression=compression) | |||
|
|||
def test_read_columns(self, engine, fp): | |||
df = pd.DataFrame({'string': list('abc'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference issue number above.
pandas/io/parquet.py
Outdated
@@ -188,6 +188,8 @@ def read_parquet(path, engine='auto', **kwargs): | |||
---------- | |||
path : string | |||
File path | |||
columns: list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write out what the default is too i.e. "list, default None"
Will need a |
Hello @hoffmann! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on November 08, 2017 at 13:04 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc comments, lgtm. otherwise.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -109,7 +109,7 @@ I/O | |||
^^^ | |||
|
|||
- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`) | |||
- | |||
- :func:`read_parquet` now allows to specify the columns to read from a parquet file (:issue:`18154`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put this on 0.21.1
@@ -188,6 +188,8 @@ def read_parquet(path, engine='auto', **kwargs): | |||
---------- | |||
path : string | |||
File path | |||
columns: list, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a version added tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -109,7 +109,7 @@ I/O | |||
^^^ | |||
|
|||
- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a small example in the docs in io.rst as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 21c5f5e
Codecov Report
@@ Coverage Diff @@
## master #18155 +/- ##
==========================================
+ Coverage 91.41% 91.41% +<.01%
==========================================
Files 163 163
Lines 50132 50132
==========================================
+ Hits 45827 45830 +3
+ Misses 4305 4302 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18155 +/- ##
==========================================
- Coverage 91.41% 91.4% -0.01%
==========================================
Files 163 163
Lines 50132 50068 -64
==========================================
- Hits 45827 45767 -60
+ Misses 4305 4301 -4
Continue to review full report at Codecov.
|
pandas/io/parquet.py
Outdated
path, _, _ = get_filepath_or_buffer(path) | ||
return self.api.parquet.read_table(path).to_pandas() | ||
return self.api.parquet.read_table(path, columns).to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass columns as a kwarg to read_table and to_pandas
pandas/io/parquet.py
Outdated
@@ -188,6 +188,9 @@ def read_parquet(path, engine='auto', **kwargs): | |||
---------- | |||
path : string | |||
File path | |||
columns: list, default=None | |||
If not None, only these columns will be read from the file. | |||
.. versionadded 0.21.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think u need a blank line before the version added tag
pandas/io/parquet.py
Outdated
@@ -201,4 +204,4 @@ def read_parquet(path, engine='auto', **kwargs): | |||
""" | |||
|
|||
impl = get_engine(engine) | |||
return impl.read(path) | |||
return impl.read(path, columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/io/test_parquet.py
Outdated
df = pd.DataFrame({'string': list('abc'), | ||
'int': list(range(1, 4))}) | ||
|
||
with tm.ensure_clean() as path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don’t need the fp argument here: engine cycles thru both engines
use check_round_trip; pass in the expected (and the columns kwarg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
you have a linting issue |
path, _, _ = get_filepath_or_buffer(path) | ||
return self.api.parquet.read_table(path).to_pandas() | ||
return self.api.parquet.read_table(path, columns=columns).to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i’d like to pass thru kwargs as well; these won’t be specific names args just pass thru to the engine to validate
for both fp and pyarrow
could just be a simple test with row_groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i think it is good to pass explicit options like columns which are supported by both backends and also pass the kwargs to be able to provide additional engine specific kwargs.
Have to look at the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that’s fine
really want row_group support :) (next PR!)
also if u want: #17102
ping on green |
@jreback green. If it's ok I'd like to do the change to accept **kwargs in the read() function in a different pull request because it will require to rewrite https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py#L191 to be able to handle **kwargs for to_parquet and read_parquet at the same time. |
yep ok by me |
@@ -4538,6 +4538,16 @@ Read from a parquet file. | |||
result.dtypes | |||
Read only certain columns of a parquet file. | |||
|
|||
.. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in next PR, can you add a version added tag here (for 0.21.1)
thanks! |
(cherry picked from commit 5128fe6)
(cherry picked from commit 5128fe6)
git diff upstream/master -u -- "*.py" | flake8 --diff