-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: support non default indexes in writing to Parquet #18629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support non default indexes in writing to Parquet #18629
Conversation
We need to check how this interacts with the |
Codecov Report
@@ Coverage Diff @@
## master #18629 +/- ##
==========================================
- Coverage 91.59% 91.58% -0.02%
==========================================
Files 155 155
Lines 51255 51260 +5
==========================================
- Hits 46949 46945 -4
- Misses 4306 4315 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18629 +/- ##
==========================================
- Coverage 91.6% 91.57% -0.04%
==========================================
Files 153 153
Lines 51317 51363 +46
==========================================
+ Hits 47011 47034 +23
- Misses 4306 4329 +23
Continue to review full report at Codecov.
|
pandas/io/parquet.py
Outdated
"to make the index into column(s)".format( | ||
type(df.index))) | ||
|
||
if not df.index.equals(RangeIndex.from_range(range(len(df)))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could create the RangeIndex
via RangeIndex(len(df))
here to make this a little cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some changes; pls. add a bug fix note in 0.22.0
pandas/io/parquet.py
Outdated
if df.index.name is not None: | ||
raise ValueError("parquet does not serialize index meta-data on a " | ||
"default index") | ||
# *unless* we're using pyarrow >= 0.7.1 which does support multi-indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it might be better (though a bit more refactoring), to move the validation of index and validation of columns into methods on a base class to the impl
e.g.
class BaseImpl(object):
def validate_index(....):
..
def validate_columns(....):
....
class PyArrowImpl(BaseImpl):
def write(...):
if pyarrow < 0.7.1:
self.validate_index()
self.validate_columns()
class FastParquetImpl(BaseImpl):
.....
as it makes things cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored in the latest commits...
pandas/tests/io/test_parquet.py
Outdated
df.index = index | ||
self.check_round_trip(df, engine) | ||
|
||
# index with meta-data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could factor these 2 tests here from pyarrow impl (and also from fastparquet) into another test (that succeeds on all impls and versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the behaviour changes based on the library and the version of the library I'm not sure how else to do it while maintaining test coverage on all versions.
Feel free to push a fix if you'd like as I'm knocking off for the evening shortly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. if you can refactor the tests a bit ok (if not no big deal). ping on green.
pandas/io/parquet.py
Outdated
self._validate_index(df) | ||
|
||
def write(self, df, path, compression, **kwargs): | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import and use pandas.core.common.AbstractMethodError
here instead
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -77,6 +77,7 @@ Other Enhancements | |||
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`) | |||
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`) | |||
- Improved wording of ``ValueError`` raised in :func:`read_csv` when the ``usecols`` argument cannot match all columns. (:issue:`17301`) | |||
- Enabled the use of non-default indexes in ``to_parquet`` with pyarrow>=0.7.0 (:issue:`18581`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :func:`DataFrame.to_parquet`
here, use double-backticks around pyarrow>=0.7.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fastparquet
also supports writing the index, so no need to still raise the errors in that case.
Also, to repeat my comment from before: we need to decide how this interacts with columns
keyword before merging this.
It seems it might just be easier all-round to just bump the pyarrow/fastparquet deps to the min required to support non-default indices - i.e. |
would be ok with bumping pyarrow dep |
1345847
to
44b0fe8
Compare
Hello @dhirschfeld! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 11, 2017 at 17:00 Hours UTC |
0f7007a
to
7b25fbe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc changes, ping on green.
ci/requirements-3.5.sh
Outdated
@@ -8,4 +8,4 @@ echo "install 35" | |||
conda remove -n pandas python-dateutil --force | |||
pip install python-dateutil | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check if we have pyarrow installed in ci/* anywhere else (and if so update the pinned versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and changed any pinned versions to the new minimum. If it wasn't already pinned I left it as is since a normal install will pull in later versions that the minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -163,7 +163,7 @@ Other Enhancements | |||
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`) | |||
- Improved wording of ``ValueError`` raised in :func:`read_csv` when the ``usecols`` argument cannot match all columns. (:issue:`17301`) | |||
- :func:`DataFrame.corrwith` now silently drops non-numeric columns when passed a Series. Before, an exception was raised (:issue:`18570`). | |||
|
|||
- Enabled the use of non-default indexes in :func:`DataFrame.to_parquet` where the underlying engine supports it (:issue:`18581`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also there is another section on changed deps, pls add another entry which says pyarrow is now min 0.7.0 for parquet (also may need to update in io.rst)
@jreback not merge on green, see my request for changes on which I got no reply yet BTW, I think we can do this for the 0.21.1 milestone. The other enhancements to |
@jorisvandenbossche - as of ...at least reading them anyway. Interestingly, it seems to write them fine and the file written by fastparquet can then be read in perfectly by pyarrow. Not sure what to do about that but since fastparquet throws a sensible error I'd be inclined to just leave it and a future update will likely add the missing functionality. from numpy.random import randn
import pandas as pd
dates = pd.date_range('01-Jan-2018', '01-Dec-2018', freq='MS')
df = pd.DataFrame(randn(2*len(dates), 3), columns=list('ABC'))
index = pd.MultiIndex.from_product([['Level1', 'Level2'], dates])
df.index = index
df.to_parquet('tmp_pa.pq', engine='pyarrow')
df.to_parquet('tmp_fp.pq', engine='fastparquet', compression='UNCOMPRESSED')
df1 = pd.io.parquet.read_parquet('tmp_pa.pq', engine='pyarrow')
df1.equals(df) |
@jorisvandenbossche - you're right about the The workaround, at the cost of efficiency, is simple enough: pd.io.parquet.read_parquet('tmp_pa.pq', engine='pyarrow')[['A','B']] ...but you have to know beforehand that the file was saved with a multi-index - i.e. it's a silent failure mode which isn't ideal. If I name the index levels you can actually pull them out by name: df = pd.DataFrame(randn(2*len(dates), 3), columns=list('ABC'))
index = pd.MultiIndex.from_product([['Level1', 'Level2'], dates], names=['level','date'])
df.index = index pd.io.parquet.read_parquet('tmp_pa.pq', engine='pyarrow', columns=['A','B','level','date']).head() ...so, ideally if pandas can tell if the file has a multi-index the index names could then always be prepended to the columns argument. |
@jorisvandenbossche - were there other changes you wanted that I've missed? |
It's a bit unfortunate it only comes up when reading it back in, so either we are ok with that, or either we need to to raise on writing MIs with fastparquet.
That will not always work. Eg if they have no names, and pyarrow master has also changed this and the index levels will never have the level names but a The thing we can do is the following:
and then add those to the
|
Heh, beat me to it: import pyarrow.parquet as pq
filemetadata = pq.read_metadata('tmp_fp.pq')
metadata = eval(filemetadata.metadata.get(b'pandas'), None, dict(null=None))
metadata['index_columns'] ...eval is pretty nasty! |
I also think that if there is a default index (range-like index without name, the case that was allowed now), we should not write it to the parquet file, because that would just be waste of time and space. |
Yep, the |
since we are bumping the version I don't think this is a good idea. marked for 0.22 I am ok with this fix, @jorisvandenbossche |
For pyarrow the columns fix is as simple as setting def read(self, path, columns=None, **kwargs):
path, _, _ = get_filepath_or_buffer(path)
parquet_file = self.api.parquet.ParquetFile(path)
kwargs['use_pandas_metadata'] = True
return parquet_file.read(columns=columns, **kwargs).to_pandas() For fastparquet I would use: def read(self, path, columns=None, **kwargs):
path, _, _ = get_filepath_or_buffer(path)
parquet_file = self.api.ParquetFile(path)
metadata = json.loads(parquet_file.key_value_metadata['pandas'])
if columns is not None:
columns = set(chain(columns, metadata['index_columns']))
return parquet_file.to_pandas(columns=columns, **kwargs) ...but since it doesnt' support multi-indexes anyway it's a moot point. |
pandas/tests/io/test_parquet.py
Outdated
@@ -6,6 +6,7 @@ | |||
from warnings import catch_warnings | |||
|
|||
import numpy as np | |||
from numpy.random import randn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't do the import like this anywhere else, appreciate changing it
pandas/tests/io/test_parquet.py
Outdated
df.columns = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)]), | ||
self.check_error_on_write(df, engine, ValueError) | ||
|
||
def test_multiindex_with_columns(self, engine): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just pass pa instead of engine to skip fp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhirschfeld Thanks for the update. I think there are still some left-overs to clean-up from the undo removing support for older versions, but I can also put some time in it this evening if needed.
pandas/io/parquet.py
Outdated
self.validate_dataframe(df) | ||
if self._pyarrow_lt_070: | ||
self._validate_write_lt_070( | ||
df, path, compression, coerce_timestamps, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you pass path, compression, coerce_timestamps
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a left over from a refactoring. Should be removed as they're no longer needed...
pandas/io/parquet.py
Outdated
if self._pyarrow_lt_060: | ||
table = self.api.Table.from_pandas(df, timestamps_to_ms=True) | ||
self.api.parquet.write_table( | ||
table, path, compression=compression, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep this for older pyarrow versions ?
pandas/io/parquet.py
Outdated
if not isinstance(df.index, Int64Index): | ||
msg = ( | ||
"parquet does not support serializing {} for the index;" | ||
"you can .reset_index() to make the index into column(s)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add here that they can also install latest pyarrow or fastparquet version (same for the others)
pandas/io/parquet.py
Outdated
class FastParquetImpl(object): | ||
parquet_file = self.api.parquet.ParquetFile(path) | ||
if self._pyarrow_lt_070: | ||
parquet_file.path = path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just pass path
to the _read_lt_070
function?
@dhirschfeld sorry for taking over your PR, but I want to get this merged for 0.21.1, and didn't know if you would have time today/tomorrow. But feel free to review my changes |
5ed8e2d
to
79de86b
Compare
@jorisvandenbossche - that's fine, I'd only be able to look into it tonight so am happy for you to take the lead! I'm hoping to make use of it so am keen to get it in too... |
LGTM! |
And all CI is passing! @jreback more comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny style comments, otherwise lgtm.
pandas/io/parquet.py
Outdated
def validate_dataframe(df): | ||
if not isinstance(df, DataFrame): | ||
raise ValueError("to_parquet only supports IO with DataFrames") | ||
# must have value column names (strings only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a blank line before comments, easier to read
@dhirschfeld Thanks a lot for starting the PR! |
) fastparquet automatically names an index 'index' if it doesn't already have a name (cherry picked from commit 8d7e876)
Thanks @jorisvandenbossche for getting this over the line! I'm excited to make use of this new functionality! :) |
fastparquet automatically names an index 'index' if it doesn't already have a name (cherry picked from commit 8d7e876)
Version 0.22.0 * tag 'v0.22.0': (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...
* releases: (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...
git diff upstream/master -u -- "*.py" | flake8 --diff