Skip to content

Commit 7f4ef32

Browse files
committed
ENH: add to/from_parquet with pyarrow & fastparquet
1 parent 7930202 commit 7f4ef32

19 files changed

+629
-6
lines changed

ci/install_travis.sh

+1
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@ fi
156156
echo
157157
echo "[removing installed pandas]"
158158
conda remove pandas -y --force
159+
pip uninstall -y pandas
159160

160161
if [ "$BUILD_TEST" ]; then
161162

ci/requirements-2.7.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
7+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet

ci/requirements-3.5.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ source activate pandas
44

55
echo "install 35"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
8-
97
# pip install python-dateutil to get latest
108
conda remove -n pandas python-dateutil --force
119
pip install python-dateutil
10+
11+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1

ci/requirements-3.5_OSX.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35_OSX"
66

7-
conda install -n pandas -c conda-forge feather-format==0.3.1
7+
conda install -n pandas -c conda-forge feather-format==0.3.1 fastparquet

ci/requirements-3.6.pip

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
brotlipy

ci/requirements-3.6.run

+2
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ sqlalchemy
1616
pymysql
1717
feather-format
1818
pyarrow=0.4.1
19+
python-snappy
20+
fastparquet
1921
# psycopg2 (not avail on defaults ATM)
2022
beautifulsoup4
2123
s3fs

ci/requirements-3.6_DOC.sh

+2-1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ echo "[install DOC_BUILD deps]"
66

77
pip install pandas-gbq
88

9-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 nbsphinx pandoc
9+
conda install -n pandas -c conda-forge nbsphinx pandoc
10+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet
1011

1112
conda install -n pandas -c r r rpy2 --yes

ci/requirements-3.6_WIN.run

+2
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@ numexpr
1313
pytables
1414
matplotlib
1515
blosc
16+
fastparquet
17+
pyarrow

doc/source/install.rst

+1
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,7 @@ Optional Dependencies
236236
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
237237
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
238238
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
239+
* ``Apache Parquet Format``, either `pyarrow <http://arrow.apache.org/docs/python/>`__ (>= 0.4.1) or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ (>= 0.0.6) for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
239240
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
240241

241242
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL

doc/source/io.rst

+71
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
4343
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
4444
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
4545
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
46+
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
4647
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
4748
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
4849
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
@@ -4550,6 +4551,76 @@ Read from a feather file.
45504551
import os
45514552
os.remove('example.feather')
45524553
4554+
4555+
.. _io.parquet:
4556+
4557+
Parquet
4558+
-------
4559+
4560+
.. versionadded:: 0.21.0
4561+
4562+
Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data
4563+
frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a
4564+
variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.
4565+
4566+
Parquet is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
4567+
dtypes, including extension dtypes such as categorical and datetime with tz.
4568+
4569+
Several caveats.
4570+
4571+
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4572+
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
4573+
- Duplicate column names and non-string columns names are not supported
4574+
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
4575+
on an attempt at serialization.
4576+
4577+
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__
4578+
4579+
.. note::
4580+
4581+
These engines are very similar and should read/write nearly identical parquet format files.
4582+
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
4583+
TODO: differing options to write non-standard columns & null treatment
4584+
4585+
.. ipython:: python
4586+
4587+
df = pd.DataFrame({'a': list('abc'),
4588+
'b': list(range(1, 4)),
4589+
'c': np.arange(3, 6).astype('u1'),
4590+
'd': np.arange(4.0, 7.0, dtype='float64'),
4591+
'e': [True, False, True],
4592+
'f': pd.Categorical(list('abc')),
4593+
'g': pd.date_range('20130101', periods=3),
4594+
'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4595+
'i': pd.date_range('20130101', periods=3, freq='ns')})
4596+
4597+
df
4598+
df.dtypes
4599+
4600+
Write to a parquet file.
4601+
4602+
.. ipython:: python
4603+
4604+
df.to_parquet('example_pa.parquet', engine='pyarrow')
4605+
df.to_parquet('example_fp.parquet', engine='fastparquet')
4606+
4607+
Read from a parquet file.
4608+
4609+
.. ipython:: python
4610+
4611+
result = pd.read_parquet('example_pa.parquet')
4612+
result = pd.read_parquet('example_fp.parquet')
4613+
4614+
# we preserve dtypes
4615+
result.dtypes
4616+
4617+
.. ipython:: python
4618+
:suppress:
4619+
4620+
import os
4621+
os.remove('example_pa.parquet')
4622+
os.remove('example_fp.parquet')
4623+
45534624
.. _io.sql:
45544625

45554626
SQL Queries

doc/source/options.rst

+2
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,8 @@ io.hdf.default_format None default format writing format,
414414
'table'
415415
io.hdf.dropna_table True drop ALL nan rows when appending
416416
to a table
417+
io.parquet.engine pyarrow The engine to use as a default for
418+
parquet reading and writing.
417419
mode.chained_assignment warn Raise an exception, warn, or no
418420
action if trying to use chained
419421
assignment, The default is warn

doc/source/whatsnew/v0.21.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ Other Enhancements
7676
- :func:`DataFrame.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)
7777
- :func:`date_range` now accepts 'YS' in addition to 'AS' as an alias for start of year (:issue:`9313`)
7878
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
79+
- Integration with Apache Parquet, including a new top-level ``pd.read_parquet()`` and ``DataFrame.to_parquet()`` method, see :ref:`here <io.parquet>`.
7980

8081
.. _whatsnew_0210.api_breaking:
8182

pandas/core/config_init.py

+11
Original file line numberDiff line numberDiff line change
@@ -462,3 +462,14 @@ def _register_xlsx(engine, other):
462462
except ImportError:
463463
# fallback
464464
_register_xlsx('openpyxl', 'xlsxwriter')
465+
466+
# Set up the io.parquet specific configuration.
467+
parquet_engine_doc = """
468+
: string
469+
The default parquet reader/writer engine. Available options:
470+
'pyarrow', 'fastparquet', the default is 'pyarrow'
471+
"""
472+
473+
with cf.config_prefix('io.parquet'):
474+
cf.register_option('engine', 'pyarrow', parquet_engine_doc,
475+
validator=is_one_of_factory(['pyarrow', 'fastparquet']))

pandas/core/frame.py

+23
Original file line numberDiff line numberDiff line change
@@ -1598,6 +1598,29 @@ def to_feather(self, fname):
15981598
from pandas.io.feather_format import to_feather
15991599
to_feather(self, fname)
16001600

1601+
def to_parquet(self, fname, engine=None, compression='snappy',
1602+
**kwargs):
1603+
"""
1604+
write out the binary parquet for DataFrames
1605+
1606+
.. versionadded:: 0.21.0
1607+
1608+
Parameters
1609+
----------
1610+
fname : str
1611+
string file path
1612+
engine : str, optional
1613+
The parquet engine, one of {'pyarrow', 'fastparquet'}
1614+
if None, will use the option: `io.parquet.engine`
1615+
compression : str, optional, default 'snappy'
1616+
compression method, includes {'gzip', 'snappy', 'brotli'}
1617+
kwargs passed to the engine
1618+
1619+
"""
1620+
from pandas.io.parquet import to_parquet
1621+
to_parquet(self, fname, engine,
1622+
compression=compression, **kwargs)
1623+
16011624
@Substitution(header='Write out column names. If a list of string is given, \
16021625
it is assumed to be aliases for the column names')
16031626
@Appender(fmt.docstring_to_string, indents=1)

pandas/io/api.py

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from pandas.io.sql import read_sql, read_sql_table, read_sql_query
1414
from pandas.io.sas import read_sas
1515
from pandas.io.feather_format import read_feather
16+
from pandas.io.parquet import read_parquet
1617
from pandas.io.stata import read_stata
1718
from pandas.io.pickle import read_pickle, to_pickle
1819
from pandas.io.packers import read_msgpack, to_msgpack

pandas/io/parquet.py

+179
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
""" parquet compat """
2+
3+
from warnings import catch_warnings
4+
from distutils.version import LooseVersion
5+
from pandas import DataFrame, RangeIndex, Int64Index, get_option
6+
from pandas.compat import range
7+
from pandas.io.common import get_filepath_or_buffer
8+
9+
10+
def get_engine(engine):
11+
""" return our implementation """
12+
13+
if engine is None:
14+
engine = get_option('io.parquet.engine')
15+
16+
if engine not in ['pyarrow', 'fastparquet']:
17+
raise ValueError("engine must be one of 'pyarrow', 'fastparquet'")
18+
19+
if engine == 'pyarrow':
20+
return PyArrowImpl()
21+
elif engine == 'fastparquet':
22+
return FastParquetImpl()
23+
24+
25+
class PyArrowImpl(object):
26+
27+
def __init__(self):
28+
# since pandas is a dependency of pyarrow
29+
# we need to import on first use
30+
31+
try:
32+
import pyarrow
33+
import pyarrow.parquet
34+
except ImportError:
35+
raise ImportError("pyarrow is required for parquet support\n\n"
36+
"you can install via conda\n"
37+
"conda install pyarrow -c conda-forge\n"
38+
"\nor via pip\n"
39+
"pip install pyarrow\n")
40+
41+
if LooseVersion(pyarrow.__version__) < '0.4.1':
42+
raise ImportError("pyarrow >= 0.4.1 is required for parquet"
43+
"support\n\n"
44+
"you can install via conda\n"
45+
"conda install pyarrow -c conda-forge\n"
46+
"\nor via pip\n"
47+
"pip install pyarrow\n")
48+
49+
self.api = pyarrow
50+
51+
def write(self, df, path, compression='snappy', **kwargs):
52+
path, _, _ = get_filepath_or_buffer(path)
53+
table = self.api.Table.from_pandas(df, timestamps_to_ms=True)
54+
self.api.parquet.write_table(
55+
table, path, compression=compression, **kwargs)
56+
57+
def read(self, path):
58+
path, _, _ = get_filepath_or_buffer(path)
59+
return self.api.parquet.read_table(path).to_pandas()
60+
61+
62+
class FastParquetImpl(object):
63+
64+
def __init__(self):
65+
# since pandas is a dependency of fastparquet
66+
# we need to import on first use
67+
68+
try:
69+
import fastparquet
70+
except ImportError:
71+
raise ImportError("fastparquet is required for parquet support\n\n"
72+
"you can install via conda\n"
73+
"conda install fastparquet -c conda-forge\n"
74+
"\nor via pip\n"
75+
"pip install fastparquet")
76+
77+
if LooseVersion(fastparquet.__version__) < '0.1.0':
78+
raise ImportError("fastparquet >= 0.1.0 is required for parquet "
79+
"support\n\n"
80+
"you can install via conda\n"
81+
"conda install fastparquet -c conda-forge\n"
82+
"\nor via pip\n"
83+
"pip install fastparquet")
84+
85+
self.api = fastparquet
86+
87+
def write(self, df, path, compression='snappy', **kwargs):
88+
# thriftpy/protocol/compact.py:339:
89+
# DeprecationWarning: tostring() is deprecated.
90+
# Use tobytes() instead.
91+
path, _, _ = get_filepath_or_buffer(path)
92+
with catch_warnings(record=True):
93+
self.api.write(path, df,
94+
compression=compression, **kwargs)
95+
96+
def read(self, path):
97+
path, _, _ = get_filepath_or_buffer(path)
98+
return self.api.ParquetFile(path).to_pandas()
99+
100+
101+
def to_parquet(df, path, engine=None, compression='snappy', **kwargs):
102+
"""
103+
Write a DataFrame to the parquet format.
104+
105+
Parameters
106+
----------
107+
df : DataFrame
108+
path : string
109+
File path
110+
engine : str, optional
111+
The parquet engine, one of {'pyarrow', 'fastparquet'}
112+
if None, will use the option: `io.parquet.engine`
113+
compression : str, optional, default 'snappy'
114+
compression method, includes {'gzip', 'snappy', 'brotli'}
115+
kwargs are passed to the engine
116+
"""
117+
118+
impl = get_engine(engine)
119+
120+
if not isinstance(df, DataFrame):
121+
raise ValueError("to_parquet only support IO with DataFrames")
122+
123+
valid_types = {'string', 'unicode'}
124+
125+
# validate index
126+
# --------------
127+
128+
# validate that we have only a default index
129+
# raise on anything else as we don't serialize the index
130+
131+
if not isinstance(df.index, Int64Index):
132+
raise ValueError("parquet does not support serializing {} "
133+
"for the index; you can .reset_index()"
134+
"to make the index into column(s)".format(
135+
type(df.index)))
136+
137+
if not df.index.equals(RangeIndex.from_range(range(len(df)))):
138+
raise ValueError("parquet does not support serializing a "
139+
"non-default index for the index; you "
140+
"can .reset_index() to make the index "
141+
"into column(s)")
142+
143+
if df.index.name is not None:
144+
raise ValueError("parquet does not serialize index meta-data on a "
145+
"default index")
146+
147+
# validate columns
148+
# ----------------
149+
150+
# must have value column names (strings only)
151+
if df.columns.inferred_type not in valid_types:
152+
raise ValueError("parquet must have string column names")
153+
154+
return impl.write(df, path, compression=compression)
155+
156+
157+
def read_parquet(path, engine=None, **kwargs):
158+
"""
159+
Load a parquet object from the file path, returning a DataFrame.
160+
161+
.. versionadded 0.21.0
162+
163+
Parameters
164+
----------
165+
path : string
166+
File path
167+
engine : str, optional
168+
The parquet engine, one of {'pyarrow', 'fastparquet'}
169+
if None, will use the option: `io.parquet.engine`
170+
kwargs are passed to the engine
171+
172+
Returns
173+
-------
174+
DataFrame
175+
176+
"""
177+
178+
impl = get_engine(engine)
179+
return impl.read(path)

0 commit comments

Comments
 (0)