Skip to content

Commit 8b31e54

Browse files
committed
ENH: feather support in the pandas IO api
closes #13092
1 parent e3d943d commit 8b31e54

16 files changed

+332
-2
lines changed

appveyor.yml

+1
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ install:
8181

8282
# add the pandas channel *before* defaults to have defaults take priority
8383
- cmd: conda config --add channels pandas
84+
- cmd: conda config --add channels conda-forge
8485
- cmd: conda config --remove channels defaults
8586
- cmd: conda config --add channels defaults
8687
- cmd: conda install anaconda-client

ci/install_travis.sh

+4-1
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,11 @@ else
7474
conda config --set always_yes true --set changeps1 false || exit 1
7575
conda update -q conda
7676

77-
# add the pandas channel *before* defaults to have defaults take priority
77+
# add the pandas channel to take priority
78+
# add the conda-forge channel *before* defaults
79+
# to add extra packages
7880
echo "add channels"
81+
conda config --add channels conda-forge || exit 1
7982
conda config --add channels pandas || exit 1
8083
conda config --remove channels defaults || exit 1
8184
conda config --add channels defaults || exit 1

ci/requirements-2.7-64.run

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ openpyxl
99
xlrd
1010
sqlalchemy
1111
lxml=3.2.1
12+
feather-format
1213
scipy
1314
xlsxwriter
1415
boto

ci/requirements-2.7.run

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ openpyxl=1.6.2
99
xlrd=0.9.2
1010
sqlalchemy=0.9.6
1111
lxml=3.2.1
12+
feather-format
1213
scipy
1314
xlsxwriter=0.4.6
1415
boto=2.36.0

ci/requirements-3.5.run

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ scipy
99
numexpr
1010
pytables
1111
html5lib
12+
feather-format
1213
lxml
1314
matplotlib
1415
jinja2

ci/requirements-3.5_OSX.run

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ xlsxwriter
55
xlrd
66
xlwt
77
numexpr
8+
feather-format
89
pytables
910
html5lib
1011
lxml

doc/source/api.rst

+8
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,14 @@ HDFStore: PyTables (HDF5)
8282
HDFStore.get
8383
HDFStore.select
8484

85+
Feather
86+
~~~~~~~
87+
88+
.. autosummary::
89+
:toctree: generated/
90+
91+
read_feather
92+
8593
SAS
8694
~~~
8795

doc/source/install.rst

+1
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,7 @@ Optional Dependencies
247247
* `SciPy <http://www.scipy.org>`__: miscellaneous statistical functions
248248
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
249249
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
250+
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage.
250251
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
251252

252253
- `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL

doc/source/io.rst

+59
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ object.
3434
* :ref:`read_csv<io.read_csv_table>`
3535
* :ref:`read_excel<io.excel_reader>`
3636
* :ref:`read_hdf<io.hdf5>`
37+
* :ref:`read_feather<io.feather>`
3738
* :ref:`read_sql<io.sql>`
3839
* :ref:`read_json<io.json_reader>`
3940
* :ref:`read_msgpack<io.msgpack>` (experimental)
@@ -49,6 +50,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
4950
* :ref:`to_csv<io.store_in_csv>`
5051
* :ref:`to_excel<io.excel_writer>`
5152
* :ref:`to_hdf<io.hdf5>`
53+
* :ref:`to_feather<io.feather>`
5254
* :ref:`to_sql<io.sql>`
5355
* :ref:`to_json<io.json_writer>`
5456
* :ref:`to_msgpack<io.msgpack>` (experimental)
@@ -4089,6 +4091,63 @@ object). This cannot be changed after table creation.
40894091
os.remove('store.h5')
40904092
40914093
4094+
.. _io.feather:
4095+
4096+
Feather
4097+
-------
4098+
4099+
.. versionadded:: 0.19.1
4100+
4101+
Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data
4102+
frames efficient, and to make sharing data across data analysis languages easy.
4103+
4104+
Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
4105+
dtypes, including extension dtypes such as categorical and datetime with tz.
4106+
4107+
Several caveats.
4108+
4109+
- This is a newer library, and the format, though stable, is not guaranteed to be backward compatible
4110+
to the earlier versions.
4111+
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4112+
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
4113+
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
4114+
on an attempt at serialization.
4115+
4116+
See the `Full Documentation <https://github.com/wesm/feather>`__
4117+
4118+
.. ipython:: python
4119+
4120+
df = pd.DataFrame({'a': list('abc'),
4121+
'b': list(range(1, 4)),
4122+
'c': np.arange(3, 6).astype('u1'),
4123+
'd': np.arange(4.0, 7.0, dtype='float64'),
4124+
'e': [True, False, True],
4125+
'f': pd.Categorical(list('abc')),
4126+
'g': pd.date_range('20130101', periods=3),
4127+
'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4128+
'g': pd.date_range('20130101', periods=3, freq='ns')})
4129+
4130+
df
4131+
df.dtypes
4132+
4133+
Write to a feather file.
4134+
4135+
.. ipython:: python
4136+
4137+
df.to_feather('example.fth)
4138+
4139+
Read from a feather file.
4140+
4141+
.. ipython:: python
4142+
4143+
pd.read_feather('example.fth')
4144+
4145+
.. ipython:: python
4146+
:suppress:
4147+
4148+
import os
4149+
os.remove('example.fth')
4150+
40924151
.. _io.sql:
40934152
40944153
SQL Queries

doc/source/whatsnew/v0.19.1.txt

+9
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,15 @@ Highlights include:
1515
:backlinks: none
1616

1717

18+
.. _whatsnew_0190.new_features:
19+
20+
New features
21+
~~~~~~~~~~~~
22+
23+
- Integration with the ``feather-format``, including a new top-level ``pd.read_feather()`` and ``DataFrame.to_feather()`` method, see :ref:`here <io.feather>`.
24+
25+
26+
1827
.. _whatsnew_0191.performance:
1928

2029
Performance Improvements

pandas/api/tests/test_api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ class TestPDApi(Base, tm.TestCase):
9393
'read_gbq', 'read_hdf', 'read_html', 'read_json',
9494
'read_msgpack', 'read_pickle', 'read_sas', 'read_sql',
9595
'read_sql_query', 'read_sql_table', 'read_stata',
96-
'read_table']
96+
'read_table', 'read_feather']
9797

9898
# top-level to_* funcs
9999
funcs_to = ['to_datetime', 'to_msgpack',

pandas/core/frame.py

+15
Original file line numberDiff line numberDiff line change
@@ -1530,6 +1530,21 @@ def to_stata(self, fname, convert_dates=None, write_index=True,
15301530
variable_labels=variable_labels)
15311531
writer.write_file()
15321532

1533+
def to_feather(self, fname):
1534+
"""
1535+
write out the binary feather-format for DataFrames
1536+
1537+
.. versionadded:: 0.19.1
1538+
1539+
Parameters
1540+
----------
1541+
fname : str
1542+
string file path
1543+
1544+
"""
1545+
from pandas.io.feather_format import to_feather
1546+
to_feather(self, fname)
1547+
15331548
@Appender(fmt.docstring_to_string, indents=1)
15341549
def to_string(self, buf=None, columns=None, col_space=None, header=True,
15351550
index=True, na_rep='NaN', formatters=None, float_format=None,

pandas/io/api.py

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
from pandas.io.html import read_html
1313
from pandas.io.sql import read_sql, read_sql_table, read_sql_query
1414
from pandas.io.sas.sasreader import read_sas
15+
from pandas.io.feather_format import read_feather
1516
from pandas.io.stata import read_stata
1617
from pandas.io.pickle import read_pickle, to_pickle
1718
from pandas.io.packers import read_msgpack, to_msgpack

pandas/io/feather_format.py

+110
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
""" feather-format compat """
2+
3+
from pandas import DataFrame, RangeIndex, MultiIndex, Int64Index
4+
from pandas.types.common import is_object_dtype
5+
from pandas.compat import range
6+
from pandas.lib import infer_dtype
7+
8+
9+
def _try_import():
10+
# since pandas is a dependency of feather
11+
# we need to import on first use
12+
13+
try:
14+
import feather
15+
except ImportError:
16+
17+
# give a nice error message
18+
raise ImportError("the feather-format library is not installed\n"
19+
"you can install via conda\n"
20+
"conda install feather-format -c conda-forge")
21+
return feather
22+
23+
24+
def to_feather(df, path):
25+
"""
26+
Write a DataFrame to the feather-format
27+
28+
Parameters
29+
----------
30+
df : DataFrame
31+
path : string
32+
File path
33+
"""
34+
if not isinstance(df, DataFrame):
35+
raise ValueError("feather only support IO with DataFrames")
36+
37+
feather = _try_import()
38+
valid_types = {'string', 'unicode'}
39+
40+
# validate index
41+
# --------------
42+
43+
# validate that we have only a default index
44+
# raise on anything else as we don't serialize the index
45+
46+
if not isinstance(df.index, (RangeIndex, Int64Index)):
47+
raise ValueError("feather does not serializing {} "
48+
"for the index; you can .reset_index()"
49+
"to make the index into column(s)".format(
50+
type(df.index)))
51+
52+
if not df.index.equals(RangeIndex.from_range(range(len(df)))):
53+
raise ValueError("feather does not serializing a non-default index "
54+
"for the index; you can .reset_index()"
55+
"to make the index into column(s)")
56+
57+
# validate columns
58+
# ----------------
59+
60+
# must have unique column names
61+
if not df.columns.is_unique:
62+
raise ValueError("feather does not support duplicate columns")
63+
64+
# must be a Index
65+
if isinstance(df.columns, MultiIndex):
66+
raise ValueError("feather does not support serializing a "
67+
"MultiIndex for the columns")
68+
69+
# must have value column names (strings only)
70+
if df.columns.inferred_type not in valid_types:
71+
raise ValueError("feather must have string column names")
72+
73+
# validate dtypes
74+
# ---------------
75+
76+
# validate that we do not have any non-string object dtypes
77+
# as these 'work', but will not properly de-serialize
78+
objects = [c for c, dtype in df.dtypes.iteritems()
79+
if is_object_dtype(dtype)]
80+
dtypes = [infer_dtype(df[c]) for c in objects]
81+
if len(set(dtypes) - valid_types):
82+
invalid = DataFrame([[i, c, dtype] for i, (c, dtype) in
83+
enumerate(zip(objects, dtypes))])
84+
invalid.columns = ['ncolumn', 'column', 'inferred_dtype']
85+
invalid = invalid[~invalid.inferred_dtype.isin(list(valid_types))]
86+
87+
msg = ("The following columns are not supported to serialize "
88+
"to the feather-format:\n\n"
89+
"{}".format(invalid.to_string()))
90+
raise ValueError(msg)
91+
92+
feather.write_dataframe(df, path)
93+
94+
95+
def read_feather(path):
96+
"""
97+
Load a feather-format object from the file path
98+
99+
Parameters
100+
----------
101+
path : string
102+
File path
103+
104+
Returns
105+
-------
106+
type of object stored in file
107+
"""
108+
109+
feather = _try_import()
110+
return feather.read_dataframe(path)

0 commit comments

Comments
 (0)