Skip to content

Commit 6056b38

Browse files
galuhsahidjorisvandenbossche
authored andcommitted
DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support (#28018)
1 parent ac39473 commit 6056b38

File tree

3 files changed

+26
-6
lines changed

3 files changed

+26
-6
lines changed

doc/source/user_guide/io.rst

+5-2
Original file line numberDiff line numberDiff line change
@@ -4710,7 +4710,8 @@ Several caveats.
47104710
indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can
47114711
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
47124712
* Index level names, if specified, must be strings.
4713-
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
4713+
* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype.
4714+
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
47144715
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
47154716
on an attempt at serialization.
47164717

@@ -4734,7 +4735,9 @@ See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ an
47344735
'd': np.arange(4.0, 7.0, dtype='float64'),
47354736
'e': [True, False, True],
47364737
'f': pd.date_range('20130101', periods=3),
4737-
'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
4738+
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4739+
'h': pd.Categorical(list('abc')),
4740+
'i': pd.Categorical(list('abc'), ordered=True)})
47384741
47394742
df
47404743
df.dtypes

doc/source/whatsnew/v1.0.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,7 @@ Categorical
176176
- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
177177
- Bug in :meth:`Categorical.astype` where ``NaN`` values were handled incorrectly when casting to int (:issue:`28406`)
178178
- :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` now work on unordered categoricals also (:issue:`21667`)
179+
- Added test to assert roundtripping to parquet with :func:`DataFrame.to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`)
179180
-
180181

181182

pandas/tests/io/test_parquet.py

+20-4
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,7 @@ def compare(repeat):
167167
df.to_parquet(path, **write_kwargs)
168168
with catch_warnings(record=True):
169169
actual = read_parquet(path, **read_kwargs)
170+
170171
tm.assert_frame_equal(expected, actual, check_names=check_names)
171172

172173
if path is None:
@@ -461,11 +462,26 @@ def test_unsupported(self, pa):
461462
def test_categorical(self, pa):
462463

463464
# supported in >= 0.7.0
464-
df = pd.DataFrame({"a": pd.Categorical(list("abc"))})
465+
df = pd.DataFrame()
466+
df["a"] = pd.Categorical(list("abcdef"))
465467

466-
# de-serialized as object
467-
expected = df.assign(a=df.a.astype(object))
468-
check_round_trip(df, pa, expected=expected)
468+
# test for null, out-of-order values, and unobserved category
469+
df["b"] = pd.Categorical(
470+
["bar", "foo", "foo", "bar", None, "bar"],
471+
dtype=pd.CategoricalDtype(["foo", "bar", "baz"]),
472+
)
473+
474+
# test for ordered flag
475+
df["c"] = pd.Categorical(
476+
["a", "b", "c", "a", "c", "b"], categories=["b", "c", "d"], ordered=True
477+
)
478+
479+
if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"):
480+
check_round_trip(df, pa)
481+
else:
482+
# de-serialized as object for pyarrow < 0.15
483+
expected = df.astype(object)
484+
check_round_trip(df, pa, expected=expected)
469485

470486
def test_s3_roundtrip(self, df_compat, s3_resource, pa):
471487
# GH #19134

0 commit comments

Comments
 (0)