Skip to content

ENH: feather support in the pandas IO api #14383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ install:
- cmd: conda config --set ssl_verify false

# add the pandas channel *before* defaults to have defaults take priority
- cmd: conda config --add channels conda-forge
- cmd: conda config --add channels pandas
- cmd: conda config --remove channels defaults
- cmd: conda config --add channels defaults
Expand Down
3 changes: 2 additions & 1 deletion ci/install_travis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,8 @@ else
conda config --set always_yes true --set changeps1 false || exit 1
conda update -q conda

# add the pandas channel *before* defaults to have defaults take priority
# add the pandas channel to take priority
# to add extra packages
echo "add channels"
conda config --add channels pandas || exit 1
conda config --remove channels defaults || exit 1
Expand Down
2 changes: 1 addition & 1 deletion ci/requirements-2.7-64.run
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ pytz
numpy=1.10*
xlwt
numexpr
pytables
pytables==3.2.2
matplotlib
openpyxl
xlrd
Expand Down
7 changes: 7 additions & 0 deletions ci/requirements-2.7.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

source activate pandas

echo "install 27"

conda install -n pandas -c conda-forge feather-format
3 changes: 2 additions & 1 deletion ci/requirements-3.5-64.run
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
python-dateutil
pytz
numpy=1.10*
numpy
openpyxl
xlsxwriter
xlrd
xlwt
scipy
feather-format
numexpr
pytables
matplotlib
Expand Down
4 changes: 1 addition & 3 deletions ci/requirements-3.5.run
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,4 @@ pymysql
psycopg2
xarray
s3fs

# incompat with conda ATM
# beautiful-soup
beautifulsoup4
7 changes: 7 additions & 0 deletions ci/requirements-3.5.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

source activate pandas

echo "install 35"

conda install -n pandas -c conda-forge feather-format
4 changes: 1 addition & 3 deletions ci/requirements-3.5_OSX.run
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,4 @@ jinja2
bottleneck
xarray
s3fs

# incompat with conda ATM
# beautiful-soup
beautifulsoup4
7 changes: 7 additions & 0 deletions ci/requirements-3.5_OSX.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

source activate pandas

echo "install 35_OSX"

conda install -n pandas -c conda-forge feather-format
9 changes: 9 additions & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ HDFStore: PyTables (HDF5)
HDFStore.get
HDFStore.select

Feather
~~~~~~~

.. autosummary::
:toctree: generated/

read_feather
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also to_feather in the dataframe section?


SAS
~~~

Expand Down Expand Up @@ -1015,6 +1023,7 @@ Serialization / IO / Conversion
DataFrame.to_excel
DataFrame.to_json
DataFrame.to_html
DataFrame.to_feather
DataFrame.to_latex
DataFrame.to_stata
DataFrame.to_msgpack
Expand Down
1 change: 1 addition & 0 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ Optional Dependencies
* `SciPy <http://www.scipy.org>`__: miscellaneous statistical functions
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:

- `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
Expand Down
64 changes: 64 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ object.
* :ref:`read_csv<io.read_csv_table>`
* :ref:`read_excel<io.excel_reader>`
* :ref:`read_hdf<io.hdf5>`
* :ref:`read_feather<io.feather>`
* :ref:`read_sql<io.sql>`
* :ref:`read_json<io.json_reader>`
* :ref:`read_msgpack<io.msgpack>` (experimental)
Expand All @@ -49,6 +50,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
* :ref:`to_csv<io.store_in_csv>`
* :ref:`to_excel<io.excel_writer>`
* :ref:`to_hdf<io.hdf5>`
* :ref:`to_feather<io.feather>`
* :ref:`to_sql<io.sql>`
* :ref:`to_json<io.json_writer>`
* :ref:`to_msgpack<io.msgpack>` (experimental)
Expand Down Expand Up @@ -4152,6 +4154,68 @@ object). This cannot be changed after table creation.
os.remove('store.h5')


.. _io.feather:

Feather
-------

.. versionadded:: 0.20.0

Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data
frames efficient, and to make sharing data across data analysis languages easy.

Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
dtypes, including extension dtypes such as categorical and datetime with tz.

Several caveats.

- This is a newer library, and the format, though stable, is not guaranteed to be backward compatible
to the earlier versions.
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional point: Non-string column names ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and duplicate column names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, these are raised automaticaly by feather now (as of 3.1)

- Duplicate column names and non-string columns names are not supported
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
on an attempt at serialization.

See the `Full Documentation <https://github.com/wesm/feather>`__

.. ipython:: python

df = pd.DataFrame({'a': list('abc'),
'b': list(range(1, 4)),
'c': np.arange(3, 6).astype('u1'),
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.Categorical(list('abc')),
'g': pd.date_range('20130101', periods=3),
'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
'i': pd.date_range('20130101', periods=3, freq='ns')})

df
df.dtypes

Write to a feather file.

.. ipython:: python

df.to_feather('example.fth)

Read from a feather file.

.. ipython:: python

result = pd.read_feather('example.fth')
result

# we preserve dtypes
result.dtypes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also show the dtypes? (so you see it is preserverd automatically)

.. ipython:: python
:suppress:

import os
os.remove('example.fth')

.. _io.sql:

SQL Queries
Expand Down
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ Check the :ref:`API Changes <whatsnew_0200.api_breaking>` and :ref:`deprecations
New features
~~~~~~~~~~~~

- Integration with the ``feather-format``, including a new top-level ``pd.read_feather()`` and ``DataFrame.to_feather()`` method, see :ref:`here <io.feather>`.



.. _whatsnew_0200.enhancements.dataio_dtype:

Expand Down
2 changes: 1 addition & 1 deletion pandas/api/tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ class TestPDApi(Base, tm.TestCase):
'read_gbq', 'read_hdf', 'read_html', 'read_json',
'read_msgpack', 'read_pickle', 'read_sas', 'read_sql',
'read_sql_query', 'read_sql_table', 'read_stata',
'read_table']
'read_table', 'read_feather']

# top-level to_* funcs
funcs_to = ['to_datetime', 'to_msgpack',
Expand Down
15 changes: 15 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1477,6 +1477,21 @@ def to_stata(self, fname, convert_dates=None, write_index=True,
variable_labels=variable_labels)
writer.write_file()

def to_feather(self, fname):
"""
write out the binary feather-format for DataFrames

.. versionadded:: 0.20.0

Parameters
----------
fname : str
string file path

"""
from pandas.io.feather_format import to_feather
to_feather(self, fname)

@Appender(fmt.docstring_to_string, indents=1)
def to_string(self, buf=None, columns=None, col_space=None, header=True,
index=True, na_rep='NaN', formatters=None, float_format=None,
Expand Down
1 change: 1 addition & 0 deletions pandas/io/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from pandas.io.html import read_html
from pandas.io.sql import read_sql, read_sql_table, read_sql_query
from pandas.io.sas.sasreader import read_sas
from pandas.io.feather_format import read_feather
from pandas.io.stata import read_stata
from pandas.io.pickle import read_pickle, to_pickle
from pandas.io.packers import read_msgpack, to_msgpack
Expand Down
101 changes: 101 additions & 0 deletions pandas/io/feather_format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
""" feather-format compat """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but can we call this file just feather.py ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, missed that the package is imported like that, confused by the feather-format package name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is an annoying 'feature' in python!


from distutils.version import LooseVersion
from pandas import DataFrame, RangeIndex, Int64Index
from pandas.compat import range


def _try_import():
# since pandas is a dependency of feather
# we need to import on first use

try:
import feather
except ImportError:

# give a nice error message
raise ImportError("the feather-format library is not installed\n"
"you can install via conda\n"
"conda install feather-format -c conda-forge\n"
"or via pip\n"
"pip install feather-format\n")

try:
feather.__version__ >= LooseVersion('0.3.1')
except AttributeError:
raise ImportError("the feather-format library must be >= "
"version 0.3.1\n"
"you can install via conda\n"
"conda install feather-format -c conda-forge"
"or via pip\n"
"pip install feather-format\n")

return feather


def to_feather(df, path):
"""
Write a DataFrame to the feather-format

Parameters
----------
df : DataFrame
path : string
File path
"""
if not isinstance(df, DataFrame):
raise ValueError("feather only support IO with DataFrames")

feather = _try_import()
valid_types = {'string', 'unicode'}

# validate index
# --------------

# validate that we have only a default index
# raise on anything else as we don't serialize the index

if not isinstance(df.index, Int64Index):
raise ValueError("feather does not serializing {} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either "does not support serializing" or "does not serialize"

"for the index; you can .reset_index()"
"to make the index into column(s)".format(
type(df.index)))

if not df.index.equals(RangeIndex.from_range(range(len(df)))):
raise ValueError("feather does not serializing a non-default index "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same here

"for the index; you can .reset_index()"
"to make the index into column(s)")

if df.index.name is not None:
raise ValueError("feather does not serialize index meta-data on a "
"default index")

# validate columns
# ----------------

# must have value column names (strings only)
if df.columns.inferred_type not in valid_types:
raise ValueError("feather must have string column names")

feather.write_dataframe(df, path)


def read_feather(path):
"""
Load a feather-format object from the file path

.. versionadded 0.20.0

Parameters
----------
path : string
File path

Returns
-------
type of object stored in file

"""

feather = _try_import()
return feather.read_dataframe(path)
Loading