-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Support MultiIndex columns in parquet (#34777) #36305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Update check to handle MultiIndex columns for parquet format
1. add whatsnew entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test?
1. Update check to handle MultiIndex columns for parquet format 2. Edit whatsnew entry. 3. Add test for writing MultiIndex columns with string column names
1. Include issue number as a comment on added test
pandas/io/parquet.py
Outdated
if not all( | ||
x.inferred_type in {"string", "empty"} for x in df.columns.levels | ||
): | ||
raise ValueError("parquet must have string column names") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can say something about 'for all values in each level of the MultiIndex'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jreback for the suggestion on the exception statement - adding that into my next commit!
pandas/tests/io/test_parquet.py
Outdated
mi_columns = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)]) | ||
df = pd.DataFrame(np.random.randn(4, 3), columns=mi_columns) | ||
self.check_error_on_write(df, engine, ValueError) | ||
|
||
def test_write_column_multiindex_string(self, pa): | ||
# GH #34777 | ||
# Not supported in fastparquet as of 0.1.3 or older pyarrow version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we > than the min pyarrow version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the min versions listed in pandas dependencies, the min pyarrow version is 0.15 while we are currently at 0.16 - at least for the dev environment that I'm working on.
pandas/tests/io/test_parquet.py
Outdated
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], | ||
["one", "two", "one", "two", "one", "two", "one", "two"], | ||
] | ||
df = pd.DataFrame(np.random.randn(8, 8), columns=arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add names to the MultiIndex levels. do these round trip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adding names to the MultiIndex levels, looks like they do round trip on pytest.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -297,6 +297,7 @@ I/O | |||
- :meth:`to_csv` did not support zip compression for binary file object not having a filename (:issue: `35058`) | |||
- :meth:`to_csv` and :meth:`read_csv` did not honor `compression` and `encoding` for path-like objects that are internally converted to file-like objects (:issue:`35677`, :issue:`26124`, and :issue:`32392`) | |||
- :meth:`to_picke` and :meth:`read_pickle` did not support compression for file-objects (:issue:`26237`, :issue:`29054`, and :issue:`29570`) | |||
- :meth:`to_parquet` did not support :class:`MultiIndex` for columns in parquet format (:issue:`34777`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would move this to other enhancements; say this is enabled with pyarrow=.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the whatsnew entry to "other enhancements" - thanks!
1. Add tests for writing Indexes and MultiIndexes for columns 2. Edit message for check to handle MultiIndex columns for parquet 3. Edit whatsnew entry to move entry to other enhancements
1. Fix PEP8 issue for error message in check for MultiIndex columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, can you add a whatsnew note; enhancements in 1.2
does this have a min pyarrow version?
add whatsnew entry: enhancements in 1.2
@jreback Since the enhancements work on pyarrow 0.15.0, we could leave the min pyarrow version as >= 0.15.0 for this pull request. I've added a whatsnew note on enhancements in 1.2 under "Other enhancements" - feel free to let me know if a more detailed whatsnew note is needed. :) P.S. Tests are failing after I did a rebase to update my branch. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
05a2f62
to
039094c
Compare
Test failures look unrelated, looks related to #37818
|
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is back green, one minor comment on whatsnew
@hweecat small comments, can you merge master and ping on green. |
thanks @hweecat and @charlesdong1991 |
all credits to @hweecat very nice job! |
Any reason why this explicitly raises errors for non-string multi-indicies? |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Support MultiIndex for columns in parquet format by updating value column names check to handle MultiIndexes.