Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit e45dfe8

Browse files
committedJul 26, 2017
ENH: add to/from_parquet with pyarrow & fastparquet
1 parent f9a552d commit e45dfe8

18 files changed

+627
-5
lines changed
 

‎ci/install_travis.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@ fi
156156
echo
157157
echo "[removing installed pandas]"
158158
conda remove pandas -y --force
159+
pip uninstall -y pandas
159160

160161
if [ "$BUILD_TEST" ]; then
161162

‎ci/requirements-2.7.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
7+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet

‎ci/requirements-3.5.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ source activate pandas
44

55
echo "install 35"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
8-
97
# pip install python-dateutil to get latest
108
conda remove -n pandas python-dateutil --force
119
pip install python-dateutil
10+
11+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1

‎ci/requirements-3.5_OSX.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35_OSX"
66

7-
conda install -n pandas -c conda-forge feather-format==0.3.1
7+
conda install -n pandas -c conda-forge feather-format==0.3.1 fastparquet

‎ci/requirements-3.6.pip

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
brotlipy

‎ci/requirements-3.6.run

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ sqlalchemy
1616
pymysql
1717
feather-format
1818
pyarrow
19+
python-snappy
20+
fastparquet
1921
# psycopg2 (not avail on defaults ATM)
2022
beautifulsoup4
2123
s3fs

‎ci/requirements-3.6_WIN.run

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@ numexpr
1313
pytables
1414
matplotlib
1515
blosc
16+
fastparquet
17+
pyarrow

‎doc/source/install.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,7 @@ Optional Dependencies
236236
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
237237
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
238238
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
239+
* ``Apache Parquet Format``, either `pyarrow <http://arrow.apache.org/docs/python/>`__ (>= 0.4.1) or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ (>= 0.0.6) for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
239240
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
240241

241242
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL

‎doc/source/io.rst

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
4343
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
4444
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
4545
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
46+
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
4647
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
4748
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
4849
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
@@ -4550,6 +4551,76 @@ Read from a feather file.
45504551
import os
45514552
os.remove('example.feather')
45524553
4554+
4555+
.. _io.parquet:
4556+
4557+
Parquet
4558+
-------
4559+
4560+
.. versionadded:: 0.21.0
4561+
4562+
Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data
4563+
frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a
4564+
variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.
4565+
4566+
Parquet is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
4567+
dtypes, including extension dtypes such as categorical and datetime with tz.
4568+
4569+
Several caveats.
4570+
4571+
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4572+
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
4573+
- Duplicate column names and non-string columns names are not supported
4574+
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
4575+
on an attempt at serialization.
4576+
4577+
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__
4578+
4579+
.. note::
4580+
4581+
These engines are very similar and should read/write nearly identical parquet format files.
4582+
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
4583+
TODO: differing options to write non-standard columns & null treatment
4584+
4585+
.. ipython:: python
4586+
4587+
df = pd.DataFrame({'a': list('abc'),
4588+
'b': list(range(1, 4)),
4589+
'c': np.arange(3, 6).astype('u1'),
4590+
'd': np.arange(4.0, 7.0, dtype='float64'),
4591+
'e': [True, False, True],
4592+
'f': pd.Categorical(list('abc')),
4593+
'g': pd.date_range('20130101', periods=3),
4594+
'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4595+
'i': pd.date_range('20130101', periods=3, freq='ns')})
4596+
4597+
df
4598+
df.dtypes
4599+
4600+
Write to a parquet file.
4601+
4602+
.. ipython:: python
4603+
4604+
df.to_parquet('example_pa.parquet', engine='pyarrow')
4605+
df.to_parquet('example_fp.parquet', engine='fastparquet')
4606+
4607+
Read from a parquet file.
4608+
4609+
.. ipython:: python
4610+
4611+
result = pd.read_parquet('example_pa.parquet')
4612+
result = pd.read_parquet('example_fp.parquet')
4613+
4614+
# we preserve dtypes
4615+
result.dtypes
4616+
4617+
.. ipython:: python
4618+
:suppress:
4619+
4620+
import os
4621+
os.remove('example_pa.parquet')
4622+
os.remove('example_fp.parquet')
4623+
45534624
.. _io.sql:
45544625

45554626
SQL Queries

‎doc/source/options.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,8 @@ io.hdf.default_format None default format writing format,
414414
'table'
415415
io.hdf.dropna_table True drop ALL nan rows when appending
416416
to a table
417+
io.parquet.engine pyarrow The engine to use as a default for
418+
parquet reading and writing.
417419
mode.chained_assignment warn Raise an exception, warn, or no
418420
action if trying to use chained
419421
assignment, The default is warn

‎doc/source/whatsnew/v0.21.0.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Other Enhancements
7878
- :func:`DataFrame.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)
7979
- :func:`date_range` now accepts 'YS' in addition to 'AS' as an alias for start of year (:issue:`9313`)
8080
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
81+
- Integration with Apache Parquet, including a new top-level ``pd.read_parquet()`` and ``DataFrame.to_parquet()`` method, see :ref:`here <io.parquet>`.
8182

8283
.. _whatsnew_0210.api_breaking:
8384

‎pandas/core/config_init.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -462,3 +462,14 @@ def _register_xlsx(engine, other):
462462
except ImportError:
463463
# fallback
464464
_register_xlsx('openpyxl', 'xlsxwriter')
465+
466+
# Set up the io.parquet specific configuration.
467+
parquet_engine_doc = """
468+
: string
469+
The default parquet reader/writer engine. Available options:
470+
'pyarrow', 'fastparquet', the default is 'pyarrow'
471+
"""
472+
473+
with cf.config_prefix('io.parquet'):
474+
cf.register_option('engine', 'pyarrow', parquet_engine_doc,
475+
validator=is_one_of_factory(['pyarrow', 'fastparquet']))

‎pandas/core/frame.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1598,6 +1598,29 @@ def to_feather(self, fname):
15981598
from pandas.io.feather_format import to_feather
15991599
to_feather(self, fname)
16001600

1601+
def to_parquet(self, fname, engine=None, compression='snappy',
1602+
**kwargs):
1603+
"""
1604+
write out the binary parquet for DataFrames
1605+
1606+
.. versionadded:: 0.21.0
1607+
1608+
Parameters
1609+
----------
1610+
fname : str
1611+
string file path
1612+
engine : str, optional
1613+
The parquet engine, one of {'pyarrow', 'fastparquet'}
1614+
if None, will use the option: `io.parquet.engine`
1615+
compression : str, optional, default 'snappy'
1616+
compression method, includes {'gzip', 'snappy', 'brotli'}
1617+
kwargs passed to the engine
1618+
1619+
"""
1620+
from pandas.io.parquet import to_parquet
1621+
to_parquet(self, fname, engine,
1622+
compression=compression, **kwargs)
1623+
16011624
@Substitution(header='Write out column names. If a list of string is given, \
16021625
it is assumed to be aliases for the column names')
16031626
@Appender(fmt.docstring_to_string, indents=1)

‎pandas/io/api.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from pandas.io.sql import read_sql, read_sql_table, read_sql_query
1414
from pandas.io.sas import read_sas
1515
from pandas.io.feather_format import read_feather
16+
from pandas.io.parquet import read_parquet
1617
from pandas.io.stata import read_stata
1718
from pandas.io.pickle import read_pickle, to_pickle
1819
from pandas.io.packers import read_msgpack, to_msgpack

‎pandas/io/parquet.py

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
""" parquet compat """
2+
3+
from warnings import catch_warnings
4+
from distutils.version import LooseVersion
5+
from pandas import DataFrame, RangeIndex, Int64Index, get_option
6+
from pandas.compat import range
7+
from pandas.io.common import get_filepath_or_buffer
8+
9+
10+
def get_engine(engine):
11+
""" return our implementation """
12+
13+
if engine is None:
14+
engine = get_option('io.parquet.engine')
15+
16+
if engine not in ['pyarrow', 'fastparquet']:
17+
raise ValueError("engine must be one of 'pyarrow', 'fastparquet'")
18+
19+
if engine == 'pyarrow':
20+
return PyArrowImpl()
21+
elif engine == 'fastparquet':
22+
return FastParquetImpl()
23+
24+
25+
class PyArrowImpl(object):
26+
27+
def __init__(self):
28+
# since pandas is a dependency of pyarrow
29+
# we need to import on first use
30+
31+
try:
32+
import pyarrow
33+
import pyarrow.parquet
34+
except ImportError:
35+
raise ImportError("pyarrow is required for parquet support\n\n"
36+
"you can install via conda\n"
37+
"conda install pyarrow -c conda-forge\n"
38+
"\nor via pip\n"
39+
"pip install pyarrow\n")
40+
41+
if LooseVersion(pyarrow.__version__) < '0.4.1':
42+
raise ImportError("pyarrow >= 0.4.1 is required for parquet"
43+
"support\n\n"
44+
"you can install via conda\n"
45+
"conda install pyarrow -c conda-forge\n"
46+
"\nor via pip\n"
47+
"pip install pyarrow\n")
48+
49+
self.api = pyarrow
50+
51+
def write(self, df, path, compression='snappy', **kwargs):
52+
path, _, _ = get_filepath_or_buffer(path)
53+
table = self.api.Table.from_pandas(df, timestamps_to_ms=True)
54+
self.api.parquet.write_table(
55+
table, path, compression=compression, **kwargs)
56+
57+
def read(self, path):
58+
path, _, _ = get_filepath_or_buffer(path)
59+
return self.api.parquet.read_table(path).to_pandas()
60+
61+
62+
class FastParquetImpl(object):
63+
64+
def __init__(self):
65+
# since pandas is a dependency of fastparquet
66+
# we need to import on first use
67+
68+
try:
69+
import fastparquet
70+
except ImportError:
71+
raise ImportError("fastparquet is required for parquet support\n\n"
72+
"you can install via conda\n"
73+
"conda install fastparquet -c conda-forge\n"
74+
"\nor via pip\n"
75+
"pip install fastparquet")
76+
77+
if LooseVersion(fastparquet.__version__) < '0.1.0':
78+
raise ImportError("fastparquet >= 0.1.0 is required for parquet "
79+
"support\n\n"
80+
"you can install via conda\n"
81+
"conda install fastparquet -c conda-forge\n"
82+
"\nor via pip\n"
83+
"pip install fastparquet")
84+
85+
self.api = fastparquet
86+
87+
def write(self, df, path, compression='snappy', **kwargs):
88+
# thriftpy/protocol/compact.py:339:
89+
# DeprecationWarning: tostring() is deprecated.
90+
# Use tobytes() instead.
91+
path, _, _ = get_filepath_or_buffer(path)
92+
with catch_warnings(record=True):
93+
self.api.write(path, df,
94+
compression=compression, **kwargs)
95+
96+
def read(self, path):
97+
path, _, _ = get_filepath_or_buffer(path)
98+
return self.api.ParquetFile(path).to_pandas()
99+
100+
101+
def to_parquet(df, path, engine=None, compression='snappy', **kwargs):
102+
"""
103+
Write a DataFrame to the parquet format.
104+
105+
Parameters
106+
----------
107+
df : DataFrame
108+
path : string
109+
File path
110+
engine : str, optional
111+
The parquet engine, one of {'pyarrow', 'fastparquet'}
112+
if None, will use the option: `io.parquet.engine`
113+
compression : str, optional, default 'snappy'
114+
compression method, includes {'gzip', 'snappy', 'brotli'}
115+
kwargs are passed to the engine
116+
"""
117+
118+
impl = get_engine(engine)
119+
120+
if not isinstance(df, DataFrame):
121+
raise ValueError("to_parquet only support IO with DataFrames")
122+
123+
valid_types = {'string', 'unicode'}
124+
125+
# validate index
126+
# --------------
127+
128+
# validate that we have only a default index
129+
# raise on anything else as we don't serialize the index
130+
131+
if not isinstance(df.index, Int64Index):
132+
raise ValueError("parquet does not support serializing {} "
133+
"for the index; you can .reset_index()"
134+
"to make the index into column(s)".format(
135+
type(df.index)))
136+
137+
if not df.index.equals(RangeIndex.from_range(range(len(df)))):
138+
raise ValueError("parquet does not support serializing a "
139+
"non-default index for the index; you "
140+
"can .reset_index() to make the index "
141+
"into column(s)")
142+
143+
if df.index.name is not None:
144+
raise ValueError("parquet does not serialize index meta-data on a "
145+
"default index")
146+
147+
# validate columns
148+
# ----------------
149+
150+
# must have value column names (strings only)
151+
if df.columns.inferred_type not in valid_types:
152+
raise ValueError("parquet must have string column names")
153+
154+
return impl.write(df, path, compression=compression)
155+
156+
157+
def read_parquet(path, engine=None, **kwargs):
158+
"""
159+
Load a parquet object from the file path, returning a DataFrame.
160+
161+
.. versionadded 0.21.0
162+
163+
Parameters
164+
----------
165+
path : string
166+
File path
167+
engine : str, optional
168+
The parquet engine, one of {'pyarrow', 'fastparquet'}
169+
if None, will use the option: `io.parquet.engine`
170+
kwargs are passed to the engine
171+
172+
Returns
173+
-------
174+
DataFrame
175+
176+
"""
177+
178+
impl = get_engine(engine)
179+
return impl.read(path)

‎pandas/tests/api/test_api.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ class TestPDApi(Base):
8282
'read_gbq', 'read_hdf', 'read_html', 'read_json',
8383
'read_msgpack', 'read_pickle', 'read_sas', 'read_sql',
8484
'read_sql_query', 'read_sql_table', 'read_stata',
85-
'read_table', 'read_feather']
85+
'read_table', 'read_feather', 'read_parquet']
8686

8787
# top-level to_* funcs
8888
funcs_to = ['to_datetime', 'to_msgpack',

‎pandas/tests/io/test_parquet.py

Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
""" test parquet compat """
2+
3+
import pytest
4+
import datetime
5+
from warnings import catch_warnings
6+
7+
import numpy as np
8+
import pandas as pd
9+
from pandas.compat import PY3, is_platform_windows
10+
from pandas.io.parquet import to_parquet, read_parquet
11+
from pandas.util import testing as tm
12+
13+
try:
14+
import pyarrow # noqa
15+
_HAVE_PYARROW = True
16+
except ImportError:
17+
_HAVE_PYARROW = False
18+
19+
try:
20+
import fastparquet # noqa
21+
_HAVE_FASTPARQUET = True
22+
except ImportError:
23+
_HAVE_FASTPARQUET = False
24+
25+
26+
# setup engines & skips
27+
@pytest.fixture(params=[
28+
pytest.mark.skipif(not _HAVE_FASTPARQUET,
29+
reason='fastparquet is not installed')('fastparquet'),
30+
pytest.mark.skipif(not _HAVE_PYARROW,
31+
reason='pyarrow is not installed')('pyarrow')])
32+
def engine(request):
33+
return request.param
34+
35+
36+
@pytest.fixture
37+
def pa():
38+
if not _HAVE_PYARROW:
39+
pytest.skip("pyarrow is not installed")
40+
if is_platform_windows():
41+
pytest.skip("pyarrow-parquet not building on windows")
42+
return 'pyarrow'
43+
44+
45+
@pytest.fixture
46+
def fp():
47+
if not _HAVE_FASTPARQUET:
48+
pytest.skip("fastparquet is not installed")
49+
return 'fastparquet'
50+
51+
52+
@pytest.fixture
53+
def df_compat():
54+
return pd.DataFrame({'A': [1, 2, 3], 'B': 'foo'})
55+
56+
57+
def test_invalid_engine(df_compat):
58+
59+
with pytest.raises(ValueError):
60+
df_compat.to_parquet('foo', 'bar')
61+
62+
63+
def test_options_py(df_compat, pa):
64+
# use the set option
65+
66+
df = df_compat
67+
with tm.ensure_clean() as path:
68+
69+
with pd.option_context('io.parquet.engine', 'pyarrow'):
70+
df.to_parquet(path)
71+
72+
result = read_parquet(path, compression=None)
73+
tm.assert_frame_equal(result, df)
74+
75+
76+
def test_options_fp(df_compat, fp):
77+
# use the set option
78+
79+
df = df_compat
80+
with tm.ensure_clean() as path:
81+
82+
with pd.option_context('io.parquet.engine', 'fastparquet'):
83+
df.to_parquet(path, compression=None)
84+
85+
result = read_parquet(path, compression=None)
86+
tm.assert_frame_equal(result, df)
87+
88+
89+
@pytest.mark.xfail(reason="fp does not ignore pa index __index_level_0__")
90+
def test_cross_engine_pa_fp(df_compat, pa, fp):
91+
# cross-compat with differing reading/writing engines
92+
93+
df = df_compat
94+
with tm.ensure_clean() as path:
95+
df.to_parquet(path, engine=pa, compression=None)
96+
97+
result = read_parquet(path, engine=fp, compression=None)
98+
tm.assert_frame_equal(result, df)
99+
100+
101+
def test_cross_engine_fp_pa(df_compat, pa, fp):
102+
# cross-compat with differing reading/writing engines
103+
104+
df = df_compat
105+
with tm.ensure_clean() as path:
106+
df.to_parquet(path, engine=fp, compression=None)
107+
108+
result = read_parquet(path, engine=pa, compression=None)
109+
tm.assert_frame_equal(result, df)
110+
111+
112+
class Base(object):
113+
114+
def check_error_on_write(self, df, engine, exc):
115+
# check that we are raising the exception
116+
# on writing
117+
118+
with pytest.raises(exc):
119+
with tm.ensure_clean() as path:
120+
to_parquet(df, path, engine, compression=None)
121+
122+
def check_round_trip(self, df, engine, expected=None, **kwargs):
123+
124+
with tm.ensure_clean() as path:
125+
df.to_parquet(path, engine, **kwargs)
126+
result = read_parquet(path, engine)
127+
128+
if expected is None:
129+
expected = df
130+
tm.assert_frame_equal(result, expected)
131+
132+
# repeat
133+
to_parquet(df, path, engine, **kwargs)
134+
result = pd.read_parquet(path, engine)
135+
136+
if expected is None:
137+
expected = df
138+
tm.assert_frame_equal(result, expected)
139+
140+
141+
class TestBasic(Base):
142+
143+
def test_error(self, engine):
144+
145+
for obj in [pd.Series([1, 2, 3]), 1, 'foo', pd.Timestamp('20130101'),
146+
np.array([1, 2, 3])]:
147+
self.check_error_on_write(obj, engine, ValueError)
148+
149+
def test_columns_dtypes(self, engine):
150+
151+
df = pd.DataFrame({'string': list('abc'),
152+
'int': list(range(1, 4))})
153+
154+
# unicode
155+
df.columns = [u'foo', u'bar']
156+
self.check_round_trip(df, engine, compression=None)
157+
158+
def test_columns_dtypes_invalid(self, engine):
159+
160+
df = pd.DataFrame({'string': list('abc'),
161+
'int': list(range(1, 4))})
162+
163+
# numeric
164+
df.columns = [0, 1]
165+
self.check_error_on_write(df, engine, ValueError)
166+
167+
if PY3:
168+
# bytes on PY3, on PY2 these are str
169+
df.columns = [b'foo', b'bar']
170+
self.check_error_on_write(df, engine, ValueError)
171+
172+
# python object
173+
df.columns = [datetime.datetime(2011, 1, 1, 0, 0),
174+
datetime.datetime(2011, 1, 1, 1, 1)]
175+
self.check_error_on_write(df, engine, ValueError)
176+
177+
def test_write_with_index(self, engine):
178+
179+
df = pd.DataFrame({'A': [1, 2, 3]})
180+
self.check_round_trip(df, engine, compression=None)
181+
182+
# non-default index
183+
for index in [[2, 3, 4],
184+
pd.date_range('20130101', periods=3),
185+
list('abc'),
186+
[1, 3, 4],
187+
pd.MultiIndex.from_tuples([('a', 1), ('a', 2),
188+
('b', 1)]),
189+
]:
190+
191+
df.index = index
192+
self.check_error_on_write(df, engine, ValueError)
193+
194+
# index with meta-data
195+
df.index = [0, 1, 2]
196+
df.index.name = 'foo'
197+
self.check_error_on_write(df, engine, ValueError)
198+
199+
# column multi-index
200+
df.index = [0, 1, 2]
201+
df.columns = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)]),
202+
self.check_error_on_write(df, engine, ValueError)
203+
204+
@pytest.mark.parametrize('compression', [None, 'gzip', 'snappy', 'brotli'])
205+
def test_compression(self, engine, compression):
206+
207+
if compression == 'snappy':
208+
pytest.importorskip('snappy')
209+
210+
elif compression == 'brotli':
211+
pytest.importorskip('brotli')
212+
213+
df = pd.DataFrame({'A': [1, 2, 3]})
214+
self.check_round_trip(df, engine, compression=compression)
215+
216+
217+
class TestParquetPyArrow(Base):
218+
219+
def test_basic(self, pa):
220+
221+
df = pd.DataFrame({'string': list('abc'),
222+
'string_with_nan': ['a', np.nan, 'c'],
223+
'string_with_none': ['a', None, 'c'],
224+
'bytes': [b'foo', b'bar', b'baz'],
225+
'unicode': [u'foo', u'bar', u'baz'],
226+
'int': list(range(1, 4)),
227+
'uint': np.arange(3, 6).astype('u1'),
228+
'float': np.arange(4.0, 7.0, dtype='float64'),
229+
'float_with_nan': [2., np.nan, 3.],
230+
'bool': [True, False, True],
231+
'bool_with_none': [True, None, True],
232+
'datetime_ns': pd.date_range('20130101', periods=3),
233+
'datetime_with_nat': [pd.Timestamp('20130101'),
234+
pd.NaT,
235+
pd.Timestamp('20130103')]
236+
})
237+
238+
self.check_round_trip(df, pa)
239+
240+
def test_duplicate_columns(self, pa):
241+
242+
# not currently able to handle duplicate columns
243+
df = pd.DataFrame(np.arange(12).reshape(4, 3),
244+
columns=list('aaa')).copy()
245+
self.check_error_on_write(df, pa, ValueError)
246+
247+
def test_unsupported(self, pa):
248+
249+
# period
250+
df = pd.DataFrame({'a': pd.period_range('2013', freq='M', periods=3)})
251+
self.check_error_on_write(df, pa, ValueError)
252+
253+
# categorical
254+
df = pd.DataFrame({'a': pd.Categorical(list('abc'))})
255+
self.check_error_on_write(df, pa, NotImplementedError)
256+
257+
# timedelta
258+
df = pd.DataFrame({'a': pd.timedelta_range('1 day',
259+
periods=3)})
260+
self.check_error_on_write(df, pa, NotImplementedError)
261+
262+
# mixed python objects
263+
df = pd.DataFrame({'a': ['a', 1, 2.0]})
264+
self.check_error_on_write(df, pa, ValueError)
265+
266+
267+
class TestParquetFastParquet(Base):
268+
269+
def test_basic(self, fp):
270+
271+
df = pd.DataFrame(
272+
{'string': list('abc'),
273+
'string_with_nan': ['a', np.nan, 'c'],
274+
'string_with_none': ['a', None, 'c'],
275+
'bytes': [b'foo', b'bar', b'baz'],
276+
'unicode': [u'foo', u'bar', u'baz'],
277+
'int': list(range(1, 4)),
278+
'uint': np.arange(3, 6).astype('u1'),
279+
'float': np.arange(4.0, 7.0, dtype='float64'),
280+
'float_with_nan': [2., np.nan, 3.],
281+
'bool': [True, False, True],
282+
'datetime': pd.date_range('20130101', periods=3),
283+
'datetime_with_nat': [pd.Timestamp('20130101'),
284+
pd.NaT,
285+
pd.Timestamp('20130103')],
286+
'timedelta': pd.timedelta_range('1 day', periods=3),
287+
})
288+
289+
self.check_round_trip(df, fp, compression=None)
290+
291+
@pytest.mark.skip(reason="not supported")
292+
def test_duplicate_columns(self, fp):
293+
294+
# not currently able to handle duplicate columns
295+
df = pd.DataFrame(np.arange(12).reshape(4, 3),
296+
columns=list('aaa')).copy()
297+
self.check_error_on_write(df, fp, ValueError)
298+
299+
def test_bool_with_none(self, fp):
300+
df = pd.DataFrame({'a': [True, None, False]})
301+
expected = pd.DataFrame({'a': [1.0, np.nan, 0.0]}, dtype='float16')
302+
self.check_round_trip(df, fp, expected=expected, compression=None)
303+
304+
def test_unsupported(self, fp):
305+
306+
# period
307+
df = pd.DataFrame({'a': pd.period_range('2013', freq='M', periods=3)})
308+
self.check_error_on_write(df, fp, ValueError)
309+
310+
# mixed
311+
df = pd.DataFrame({'a': ['a', 1, 2.0]})
312+
self.check_error_on_write(df, fp, ValueError)
313+
314+
def test_categorical(self, fp):
315+
df = pd.DataFrame({'a': pd.Categorical(list('abc'))})
316+
self.check_round_trip(df, fp, compression=None)
317+
318+
def test_datetime_tz(self, fp):
319+
# doesn't preserve tz
320+
df = pd.DataFrame({'a': pd.date_range('20130101', periods=3,
321+
tz='US/Eastern')})
322+
323+
# warns on the coercion
324+
with catch_warnings(record=True):
325+
self.check_round_trip(df, fp, df.astype('datetime64[ns]'),
326+
compression=None)

‎pandas/util/_print_versions.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ def show_versions(as_json=False):
9494
("psycopg2", lambda mod: mod.__version__),
9595
("jinja2", lambda mod: mod.__version__),
9696
("s3fs", lambda mod: mod.__version__),
97+
("fastparquet", lambda mod: mod.__version__),
9798
("pandas_gbq", lambda mod: mod.__version__),
9899
("pandas_datareader", lambda mod: mod.__version__),
99100
]

0 commit comments

Comments
 (0)
Please sign in to comment.