-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Support MultiIndex columns in parquet (#34777) #36305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
681ac1f
a46e46f
c974259
e9ff779
1b9e3f0
9e8f4eb
cc0f504
3ba38fa
a4131d2
26966b7
cc8e85c
3b9b52a
ed5fe60
c859a4f
039094c
167ae69
2e4fc58
180ddff
234009b
ab24628
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,7 @@ | |
from pandas.compat._optional import import_optional_dependency | ||
from pandas.errors import AbstractMethodError | ||
|
||
from pandas import DataFrame, get_option | ||
from pandas import DataFrame, MultiIndex, get_option | ||
|
||
from pandas.io.common import get_filepath_or_buffer, is_fsspec_url, stringify_path | ||
|
||
|
@@ -53,9 +53,15 @@ def validate_dataframe(df: DataFrame): | |
if not isinstance(df, DataFrame): | ||
raise ValueError("to_parquet only supports IO with DataFrames") | ||
|
||
# must have value column names (strings only) | ||
if df.columns.inferred_type not in {"string", "empty"}: | ||
raise ValueError("parquet must have string column names") | ||
# must have value column names for all index levels (strings only) | ||
if isinstance(df.columns, MultiIndex): | ||
if not all( | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
x.inferred_type in {"string", "empty"} for x in df.columns.levels | ||
): | ||
raise ValueError("parquet must have string column names") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can say something about 'for all values in each level of the MultiIndex' There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @jreback for the suggestion on the exception statement - adding that into my next commit! |
||
else: | ||
if df.columns.inferred_type not in {"string", "empty"}: | ||
raise ValueError("parquet must have string column names") | ||
|
||
# index level names must be strings | ||
valid_names = all( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -410,11 +410,24 @@ def test_write_multiindex(self, pa): | |
check_round_trip(df, engine) | ||
|
||
def test_write_column_multiindex(self, engine): | ||
# column multi-index | ||
# Not able to write column multi-indexes with non-string column names. | ||
mi_columns = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)]) | ||
df = pd.DataFrame(np.random.randn(4, 3), columns=mi_columns) | ||
self.check_error_on_write(df, engine, ValueError) | ||
|
||
def test_write_column_multiindex_string(self, pa): | ||
dsaxton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Not supported in fastparquet as of 0.1.3 or older pyarrow version | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are we > than the min pyarrow version? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on the min versions listed in pandas dependencies, the min pyarrow version is 0.15 while we are currently at 0.16 - at least for the dev environment that I'm working on. |
||
engine = pa | ||
|
||
# Write column multi-indexes with string column names | ||
arrays = [ | ||
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], | ||
["one", "two", "one", "two", "one", "two", "one", "two"], | ||
] | ||
df = pd.DataFrame(np.random.randn(8, 8), columns=arrays) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add names to the MultiIndex levels. do these round trip? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After adding names to the MultiIndex levels, looks like they do round trip on pytest. |
||
|
||
check_round_trip(df, engine) | ||
|
||
def test_multiindex_with_columns(self, pa): | ||
engine = pa | ||
dates = pd.date_range("01-Jan-2018", "01-Dec-2018", freq="MS") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would move this to other enhancements; say this is enabled with pyarrow=.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the whatsnew entry to "other enhancements" - thanks!