Skip to content

BUG: Remove unnecessary validation to non-string columns/index in df.to_parquet #52036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 17, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,11 @@ def validate_dataframe(df: DataFrame) -> None:
each level of the MultiIndex
"""
)
elif df.columns.inferred_type not in {"string", "empty"}:
elif not df.columns.empty and df.columns.inferred_type not in {
"string",
"empty",
}:
# GH 52034: RangeIndex.inferred_dtype is always "integer" if empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be too broad? An empty Index with int dtype should probably still raise?

E.g. Index([1], dtype="int64") should behave the same as Index([], dtype="int64")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually, do you know why we have this limitation on string columns names? pyarrow doesn't seem to have this limitation.

In [25]: tb = pa.Table.from_pandas(pd.DataFrame({1: [2]}))

In [26]: tb
Out[26]:
pyarrow.Table
1: int64
----
1: [[2]]

In [27]: pq.write_table(tb, "abc")

In [28]: pq.read_table("abc")
Out[28]:
pyarrow.Table
1: int64
----
1: [[2]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an old artefact (same as for read_orc that we removed a couple of days ago), so I'd be ok with getting rid of this if not necessary

raise ValueError("parquet must have string column names")

# index level names must be strings
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -1041,6 +1041,11 @@ def test_read_dtype_backend_pyarrow_config_index(self, pa):
expected=expected,
)

def test_empty_columns(self, pa):
# GH 52034
df = pd.DataFrame(index=pd.Index(["a", "b", "c"], name="custom name"))
check_round_trip(df, pa)


class TestParquetFastParquet(Base):
def test_basic(self, fp, df_full):
Expand Down Expand Up @@ -1281,3 +1286,12 @@ def test_invalid_dtype_backend(self, engine):
df.to_parquet(path)
with pytest.raises(ValueError, match=msg):
read_parquet(path, dtype_backend="numpy")

def test_empty_columns(self, fp):
# GH 52034
df = pd.DataFrame(index=pd.Index(["a", "b", "c"], name="custom name"))
expected = pd.DataFrame(
columns=pd.Index([], dtype=object),
index=pd.Index(["a", "b", "c"], name="custom name"),
)
check_round_trip(df, fp, expected=expected)