-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: feather support in the pandas IO api #14383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
# validate that we do not have any non-string object dtypes | ||
# as these 'work', but will not properly de-serialize | ||
valid_types = set(['string', 'unicode']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this provides a check for this serialization issue: wesm/feather#240
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{'string', 'unicode'}
works too FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed....old habits :>
cc @wesm |
Current coverage is 84.78% (diff: 82.35%)@@ master #14383 diff @@
==========================================
Files 144 145 +1
Lines 51056 51090 +34
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43225 43316 +91
+ Misses 7831 7774 -57
Partials 0 0
|
306987a
to
c0d9558
Compare
|
||
# must be a Index | ||
if isinstance(df.columns, MultiIndex): | ||
raise ValueError("feather does not serializing a MultiIndex " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
# validate that we do not have any non-string object dtypes | ||
# as these 'work', but will not properly de-serialize | ||
objects = df.select_dtypes(include=['object']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, fixed another way
msg = ("The following columns are not supported to serialize " | ||
"to the feather-format:\n\n" | ||
"{}".format(invalid.to_string())) | ||
raise ValueError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole function should be pushed down into Feather itself, and the dtype inference can be handled there too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for the unit tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
Shall we target this for 0.20? |
some of these checks can be obviated by: wesm/feather#244 & wesm/feather#243 |
27eb06a
to
8b31e54
Compare
ERR: nicer error message on passing duplicate columns, xref #53 ERR: nice error message on serializing python objects, closes #240 should be after #239 xref pandas-dev/pandas#14383 Author: Jeff Reback <[email protected]> Closes #244 from jreback/infer_dtype and squashes the following commits: 091ee60 [Jeff Reback] fix pandas < 0.19.0 compat 6ef3a47 [Jeff Reback] fixed up mixed with embedded nulls tests 560b9e6 [Jeff Reback] ERR: nicer error message on passing duplicate columns, xref #53
de574d5
to
48f64c1
Compare
c1a4012
to
182aa28
Compare
b3945c4
to
0123d1c
Compare
this is ready to go. |
# validate that we have only a default index | ||
# raise on anything else as we don't serialize the index | ||
|
||
if not isinstance(df.index, (RangeIndex, Int64Index)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be simplified to isinstance(df.index, Int64Index)
?
In [6]: idx
Out[6]: RangeIndex(start=0, stop=10, step=1)
In [7]: isinstance(idx, pd.Int64Index)
Out[7]: True
Or is that not by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you should also check whether df.index.name
is not None, and raise if it's set (or maybe warn?). I'm thinking about this too a bit in the JSON Table Schema stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah I can change the instance.
hmm, yes I will test for meta-data as well.
.. ipython:: python | ||
|
||
pd.read_feather('example.fth') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also show the dtypes? (so you see it is preserverd automatically)
.. autosummary:: | ||
:toctree: generated/ | ||
|
||
read_feather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also to_feather
in the dataframe section?
- This is a newer library, and the format, though stable, is not guaranteed to be backward compatible | ||
to the earlier versions. | ||
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an | ||
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional point: Non-string column names ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and duplicate column names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, these are raised automaticaly by feather now (as of 3.1)
@@ -0,0 +1,93 @@ | |||
""" feather-format compat """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor, but can we call this file just feather.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, missed that the package is imported like that, confused by the feather-format package name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that is an annoying 'feature' in python!
# give a nice error message | ||
raise ImportError("the feather-format library is not installed\n" | ||
"you can install via conda\n" | ||
"conda install feather-format -c conda-forge") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or pip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, on linux I suppose this is ok (maybe mac), but certainly not on windows.
@wesm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll work on Windows if you have the right compiler setup. Can recommend both here I suppose
# raise on anything else as we don't serialize the index | ||
|
||
if not isinstance(df.index, (RangeIndex, Int64Index)): | ||
raise ValueError("feather does not serializing {} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either "does not support serializing" or "does not serialize"
type(df.index))) | ||
|
||
if not df.index.equals(RangeIndex.from_range(range(len(df)))): | ||
raise ValueError("feather does not serializing a non-default index " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same here
0c952d8
to
11a0e30
Compare
all comments addressed (except if we want to recommend pip installs) |
ae9e4c5
to
4bad608
Compare
any final comments. @wesm ? going to merge in the next day or 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# give a nice error message | ||
raise ImportError("the feather-format library is not installed\n" | ||
"you can install via conda\n" | ||
"conda install feather-format -c conda-forge") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll work on Windows if you have the right compiler setup. Can recommend both here I suppose
ok added pip install instructions as well. |
@jreback we should also add feather-format to the list of the doc building requirements (ci/requirements-2.7_DOC_BUILD.run) |
yep will do |
FYI, has to be in the |
We've been using |
OK, done: e27b296 |
closes #13092