Skip to content

restrict columns to read for pandas.read_parquet #18155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 8, 2017
10 changes: 10 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4538,6 +4538,16 @@ Read from a parquet file.

result.dtypes

Read only certain columns of a parquet file.

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in next PR, can you add a version added tag here (for 0.21.1)


result = pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
result = pd.read_parquet('example_fp.parquet', engine='fastparquet', columns=['a', 'b'])

result.dtypes


.. ipython:: python
:suppress:

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ I/O
- Bug in :func:`read_csv` when reading a compressed UTF-16 encoded file (:issue:`18071`)
- Bug in :func:`read_csv` for handling null values in index columns when specifying ``na_filter=False`` (:issue:`5239`)
- Bug in :meth:`DataFrame.to_csv` when the table had ``MultiIndex`` columns, and a list of strings was passed in for ``header`` (:issue:`5539`)
- :func:`read_parquet` now allows to specify the columns to read from a parquet file (:issue:`18154`)

Plotting
^^^^^^^^
Expand Down
15 changes: 9 additions & 6 deletions pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,9 @@ def write(self, df, path, compression='snappy',
table, path, compression=compression,
coerce_timestamps=coerce_timestamps, **kwargs)

def read(self, path):
def read(self, path, columns=None):
path, _, _ = get_filepath_or_buffer(path)
return self.api.parquet.read_table(path).to_pandas()
return self.api.parquet.read_table(path, columns).to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass columns as a kwarg to read_table and to_pandas



class FastParquetImpl(object):
Expand Down Expand Up @@ -115,9 +115,9 @@ def write(self, df, path, compression='snappy', **kwargs):
self.api.write(path, df,
compression=compression, **kwargs)

def read(self, path):
def read(self, path, columns=None):
path, _, _ = get_filepath_or_buffer(path)
return self.api.ParquetFile(path).to_pandas()
return self.api.ParquetFile(path).to_pandas(columns)


def to_parquet(df, path, engine='auto', compression='snappy', **kwargs):
Expand Down Expand Up @@ -178,7 +178,7 @@ def to_parquet(df, path, engine='auto', compression='snappy', **kwargs):
return impl.write(df, path, compression=compression)


def read_parquet(path, engine='auto', **kwargs):
def read_parquet(path, engine='auto', columns=None, **kwargs):
"""
Load a parquet object from the file path, returning a DataFrame.

Expand All @@ -188,6 +188,9 @@ def read_parquet(path, engine='auto', **kwargs):
----------
path : string
File path
columns: list, default=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a version added tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

If not None, only these columns will be read from the file.
.. versionadded 0.21.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think u need a blank line before the version added tag

engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet reader library to use. If 'auto', then the option
'io.parquet.engine' is used. If 'auto', then the first
Expand All @@ -201,4 +204,4 @@ def read_parquet(path, engine='auto', **kwargs):
"""

impl = get_engine(engine)
return impl.read(path)
return impl.read(path, columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

12 changes: 12 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,18 @@ def test_compression(self, engine, compression):
df = pd.DataFrame({'A': [1, 2, 3]})
self.check_round_trip(df, engine, compression=compression)

def test_read_columns(self, engine, fp):
# GH18154
df = pd.DataFrame({'string': list('abc'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number above.

'int': list(range(1, 4))})

with tm.ensure_clean() as path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don’t need the fp argument here: engine cycles thru both engines
use check_round_trip; pass in the expected (and the columns kwarg)

df.to_parquet(path, engine, compression=None)
result = read_parquet(path, engine, columns=["string"])

expected = pd.DataFrame({'string': list('abc')})
tm.assert_frame_equal(result, expected)


class TestParquetPyArrow(Base):

Expand Down