Skip to content

ENH: feather support in the pandas IO api #14383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Oct 10, 2016

closes #13092

@jreback jreback added Enhancement IO Data IO issues that don't fit into a more specific label labels Oct 10, 2016
@jreback jreback added this to the 0.19.1 milestone Oct 10, 2016

# validate that we do not have any non-string object dtypes
# as these 'work', but will not properly de-serialize
valid_types = set(['string', 'unicode'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this provides a check for this serialization issue: wesm/feather#240

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{'string', 'unicode'} works too FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed....old habits :>

@jreback
Copy link
Contributor Author

jreback commented Oct 10, 2016

cc @wesm

@codecov-io
Copy link

codecov-io commented Oct 10, 2016

Current coverage is 84.78% (diff: 82.35%)

Merging #14383 into master will increase coverage by 0.12%

@@             master     #14383   diff @@
==========================================
  Files           144        145     +1   
  Lines         51056      51090    +34   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43225      43316    +91   
+ Misses         7831       7774    -57   
  Partials          0          0          

Powered by Codecov. Last update 72786cc...aa3c4b4

@jreback jreback force-pushed the feather branch 3 times, most recently from 306987a to c0d9558 Compare October 10, 2016 15:24

# must be a Index
if isinstance(df.columns, MultiIndex):
raise ValueError("feather does not serializing a MultiIndex "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


# validate that we do not have any non-string object dtypes
# as these 'work', but will not properly de-serialize
objects = df.select_dtypes(include=['object'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fixed another way

msg = ("The following columns are not supported to serialize "
"to the feather-format:\n\n"
"{}".format(invalid.to_string()))
raise ValueError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole function should be pushed down into Feather itself, and the dtype inference can be handled there too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for the unit tests...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@jorisvandenbossche
Copy link
Member

Shall we target this for 0.20?

@jreback
Copy link
Contributor Author

jreback commented Oct 24, 2016

some of these checks can be obviated by: wesm/feather#244 & wesm/feather#243

@jreback jreback force-pushed the feather branch 2 times, most recently from 27eb06a to 8b31e54 Compare October 26, 2016 10:31
wesm pushed a commit to wesm/feather that referenced this pull request Oct 28, 2016
ERR: nicer error message on passing duplicate columns, xref #53
ERR: nice error message on serializing python objects, closes #240

should be after #239

xref pandas-dev/pandas#14383

Author: Jeff Reback <[email protected]>

Closes #244 from jreback/infer_dtype and squashes the following commits:

091ee60 [Jeff Reback] fix pandas < 0.19.0 compat
6ef3a47 [Jeff Reback] fixed up mixed with embedded nulls tests
560b9e6 [Jeff Reback] ERR: nicer error message on passing duplicate columns, xref #53
@jreback jreback force-pushed the feather branch 4 times, most recently from de574d5 to 48f64c1 Compare November 30, 2016 23:42
@jreback jreback force-pushed the feather branch 2 times, most recently from c1a4012 to 182aa28 Compare December 13, 2016 11:39
@jreback jreback force-pushed the feather branch 3 times, most recently from b3945c4 to 0123d1c Compare December 16, 2016 11:02
@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2016

this is ready to go.

@jorisvandenbossche
@wesm

# validate that we have only a default index
# raise on anything else as we don't serialize the index

if not isinstance(df.index, (RangeIndex, Int64Index)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified to isinstance(df.index, Int64Index)?

In [6]: idx
Out[6]: RangeIndex(start=0, stop=10, step=1)

In [7]: isinstance(idx, pd.Int64Index)
Out[7]: True

Or is that not by design?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you should also check whether df.index.name is not None, and raise if it's set (or maybe warn?). I'm thinking about this too a bit in the JSON Table Schema stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah I can change the instance.

hmm, yes I will test for meta-data as well.

.. ipython:: python

pd.read_feather('example.fth')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also show the dtypes? (so you see it is preserverd automatically)

.. autosummary::
:toctree: generated/

read_feather
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also to_feather in the dataframe section?

- This is a newer library, and the format, though stable, is not guaranteed to be backward compatible
to the earlier versions.
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional point: Non-string column names ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and duplicate column names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, these are raised automaticaly by feather now (as of 3.1)

@@ -0,0 +1,93 @@
""" feather-format compat """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but can we call this file just feather.py ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, missed that the package is imported like that, confused by the feather-format package name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is an annoying 'feature' in python!

# give a nice error message
raise ImportError("the feather-format library is not installed\n"
"you can install via conda\n"
"conda install feather-format -c conda-forge")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or pip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, on linux I suppose this is ok (maybe mac), but certainly not on windows.

@wesm ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll work on Windows if you have the right compiler setup. Can recommend both here I suppose

# raise on anything else as we don't serialize the index

if not isinstance(df.index, (RangeIndex, Int64Index)):
raise ValueError("feather does not serializing {} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either "does not support serializing" or "does not serialize"

type(df.index)))

if not df.index.equals(RangeIndex.from_range(range(len(df)))):
raise ValueError("feather does not serializing a non-default index "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same here

@jreback jreback force-pushed the feather branch 2 times, most recently from 0c952d8 to 11a0e30 Compare December 16, 2016 13:55
@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2016

all comments addressed (except if we want to recommend pip installs)

@jreback
Copy link
Contributor Author

jreback commented Dec 26, 2016

any final comments. @wesm ?

going to merge in the next day or 2.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# give a nice error message
raise ImportError("the feather-format library is not installed\n"
"you can install via conda\n"
"conda install feather-format -c conda-forge")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll work on Windows if you have the right compiler setup. Can recommend both here I suppose

@jreback
Copy link
Contributor Author

jreback commented Dec 26, 2016

ok added pip install instructions as well.

@jreback jreback closed this in eecfa88 Dec 26, 2016
@jorisvandenbossche
Copy link
Member

@jreback we should also add feather-format to the list of the doc building requirements (ci/requirements-2.7_DOC_BUILD.run)

@jreback
Copy link
Contributor Author

jreback commented Dec 27, 2016

yep will do

@jreback
Copy link
Contributor Author

jreback commented Dec 28, 2016

@jorisvandenbossche 3e3434b

FYI, has to be in the .sh script because we don't have conda-forge in the default channel list on travis; we get some weird override behavior with some packages, I think because of some incorrect meta-data, but these are really old and not likely to be changed. The .sh solves this problem by simply allowing arbitrary stuff for any build which needs it.

@jorisvandenbossche
Copy link
Member

@jreback Seems to be working now! There was only still a small typo: 975a440

@jorisvandenbossche
Copy link
Member

@jreback @wesm A more stylistic question: Jeff now used .fth as the file extension, but I think I have mainly seen .feather (in blogposts about it). Best to be consistent here. @wesm do you prefer one of both?

@wesm
Copy link
Member

wesm commented Dec 29, 2016

We've been using .feather, maybe stick with that.

@jorisvandenbossche
Copy link
Member

OK, done: e27b296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: pd.read_feather/to_feather
6 participants