Skip to content

ENH: update feather IO for pyarrow 0.17 / Feather V2 #33422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,7 @@
"python": ("https://docs.python.org/3/", None),
"scipy": ("https://docs.scipy.org/doc/scipy/reference/", None),
"statsmodels": ("https://www.statsmodels.org/devel/", None),
"pyarrow": ("https://arrow.apache.org/docs/", None),
}

# extlinks alias
Expand Down
6 changes: 2 additions & 4 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4583,16 +4583,14 @@ frames efficient, and to make sharing data across data analysis languages easy.
Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
dtypes, including extension dtypes such as categorical and datetime with tz.

Several caveats.
Several caveats:

* This is a newer library, and the format, though stable, is not guaranteed to be backward compatible
to the earlier versions.
* The format will NOT write an ``Index``, or ``MultiIndex`` for the
``DataFrame`` and will raise an error if a non-default one is provided. You
can ``.reset_index()`` to store the index or ``.reset_index(drop=True)`` to
ignore it.
* Duplicate column names and non-string columns names are not supported
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
* Non supported types actual Python object types. These will raise a helpful error message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a word here. Maybe

Suggested change
* Non supported types actual Python object types. These will raise a helpful error message
* object-dtype columns are not supported. This will raise with a helpful error message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my edification, does this mean that PeriodIndex or Series[Period] is supported? If so, is that a change from the older version?

on an attempt at serialization.

See the `Full Documentation <https://github.com/wesm/feather>`__.
Expand Down
12 changes: 9 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2058,18 +2058,24 @@ def to_stata(
writer.write_file()

@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
def to_feather(self, path) -> None:
def to_feather(self, path, **kwargs) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to make these explicit?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to make these explicit?

It might change with the pyarrow version, needing us to each time update if other keywords get added. Passing through kwargs makes this more "future-robust".

But I could make the ones that there are now explicit. However, that also means that we need to check the pyarrow version to give a nice error message to say which keyword is not yet supported with the older pyarrow versions (which is of course not that difficult)

"""
Write out the binary feather-format for DataFrames.
Write a DataFrame to the binary Feather format.

Parameters
----------
path : str
String file path.
**kwargs :
Additional keywords passed to :func:`pyarrow.feather.write_feather`.
Starting with pyarrow 0.17, this includes the `compression`,
`compression_level`, `chunksize` and `version` keywords.

.. versionadded:: 1.1.0
"""
from pandas.io.feather_format import to_feather

to_feather(self, path)
to_feather(self, path, **kwargs)

@Appender(
"""
Expand Down
9 changes: 6 additions & 3 deletions pandas/io/feather_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,18 @@
from pandas.io.common import stringify_path


def to_feather(df: DataFrame, path):
def to_feather(df: DataFrame, path, **kwargs):
"""
Write a DataFrame to the feather-format
Write a DataFrame to the binary Feather format.

Parameters
----------
df : DataFrame
path : string file path, or file-like object
**kwargs :
Additional keywords passed to `pyarrow.feather.write_feather`.

.. versionadded:: 1.1.0
"""
import_optional_dependency("pyarrow")
from pyarrow import feather
Expand Down Expand Up @@ -58,7 +61,7 @@ def to_feather(df: DataFrame, path):
if df.columns.inferred_type not in valid_types:
raise ValueError("feather must have string column names")

feather.write_feather(df, path)
feather.write_feather(df, path, **kwargs)


def read_feather(path, columns=None, use_threads: bool = True):
Expand Down
17 changes: 12 additions & 5 deletions pandas/tests/io/test_feather.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
import numpy as np
import pytest

import pandas.util._test_decorators as td

import pandas as pd
import pandas._testing as tm

Expand All @@ -27,15 +29,15 @@ def check_error_on_write(self, df, exc):
with tm.ensure_clean() as path:
to_feather(df, path)

def check_round_trip(self, df, expected=None, **kwargs):
def check_round_trip(self, df, expected=None, write_kwargs={}, **read_kwargs):

if expected is None:
expected = df

with tm.ensure_clean() as path:
to_feather(df, path)
to_feather(df, path, **write_kwargs)

result = read_feather(path, **kwargs)
result = read_feather(path, **read_kwargs)
tm.assert_frame_equal(result, expected)

def test_error(self):
Expand Down Expand Up @@ -102,8 +104,8 @@ def test_read_columns(self):

def test_unsupported_other(self):

# period
df = pd.DataFrame({"a": pd.period_range("2013", freq="M", periods=3)})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since feather now exactly maps to the Arrow memory, periods are now supported (since Period is supported in the pandas->pyarrow.Table conversion)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that answers my question above, never mind

# mixed python objects
df = pd.DataFrame({"a": ["a", 1, 2.0]})
# Some versions raise ValueError, others raise ArrowInvalid.
self.check_error_on_write(df, Exception)

Expand Down Expand Up @@ -148,3 +150,8 @@ def test_path_localpath(self):
df = tm.makeDataFrame().reset_index()
result = tm.round_trip_localpath(df.to_feather, pd.read_feather)
tm.assert_frame_equal(df, result)

@td.skip_if_no("pyarrow", min_version="0.16.1.dev")
def test_passthrough_keywords(self):
df = tm.makeDataFrame().reset_index()
self.check_round_trip(df, write_kwargs=dict(version=1))