Skip to content

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Oct 4, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4700,7 +4700,7 @@ Several caveats.
indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
* Index level names, if specified, must be strings.
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
* Categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as ``object`` dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true with both engines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I've just tried categorical dtypes using integer and in pyarrow it gets de-serialized into float. Not sure if it's expected, if someone can confirm it'd be great.
>>> import pandas as pd
>>> from pandas.io.parquet import read_parquet, to_parquet

>>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
>>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True)

>>> df["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]

>>> df["b"]
0    NaN
1      2
2      3
3    NaN
Name: b, dtype: category
Categories (3, int64): [2 < 3 < 4]

>>> df.to_parquet("test_pyarrow.parquet", engine="pyarrow")
>>> actual_pyarrow = df.read_parquet("test_pyarrow.parquet", engine="fastparquet")

>>> actual_pyarrow["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]

>>> actual_pyarrow["b"]
0    NaN
1    2.0
2    3.0
3    NaN
Name: b, dtype: float64
  • With fastparquet, both string and non-string types get de-serialized back as category, however it does not preserve order for both, while in pyarrow (for string) we do preserve order. Should we make this clear in the docs?
>>> import pandas as pd
>>> from pandas.io.parquet import read_parquet, to_parquet

>>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
>>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True)

>>> df["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]

>>> df["b"]
0    NaN
1      2
2      3
3    NaN
Name: b, dtype: category
Categories (3, int64): [2 < 3 < 4]

>>> df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
>>> actual_fastparquet = df.read_parquet("test_fastparquet.parquet", engine="fastparquet")

>>> actual_fastparquet["a"]
0    NaN
1      b
2      c
3    NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]

>>> actual_fastparquet["b"]
0    NaN
1      2
2      3
3    NaN
Name: b, dtype: category
Categories (3, int64): [2, 3, 4]

* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
on an attempt at serialization.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Categorical
^^^^^^^^^^^

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
- Added test to assert roundtripping to parquet with :func:`to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.to_parquet

-


Expand Down
11 changes: 8 additions & 3 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
""" test parquet compat """
import datetime
from distutils.version import LooseVersion
import os
from warnings import catch_warnings

Expand Down Expand Up @@ -166,6 +167,7 @@ def compare(repeat):
df.to_parquet(path, **write_kwargs)
with catch_warnings(record=True):
actual = read_parquet(path, **read_kwargs)

tm.assert_frame_equal(expected, actual, check_names=check_names)

if path is None:
Expand Down Expand Up @@ -453,9 +455,12 @@ def test_categorical(self, pa):
# supported in >= 0.7.0
df = pd.DataFrame({"a": pd.Categorical(list("abc"))})

# de-serialized as object
expected = df.assign(a=df.a.astype(object))
check_round_trip(df, pa, expected=expected)
if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche this isn't released yet, right? Should we wait to merge until 0.15.0 is released?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't released yet. I can assure that it runs locally for me on Arrow master (if I change the version check to > LooseVersion("0.14.1.dev")), so I am OK to merge this, but also fine to wait a bit more (0.15.0 will normally happen somewhere end of next week)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Arrow 0.15.0 has been released, can we merge this now?

check_round_trip(df, pa)
else:
# de-serialized as object for pyarrow < 0.15
expected = df.assign(a=df.a.astype(object))
check_round_trip(df, pa, expected=expected)

def test_s3_roundtrip(self, df_compat, s3_resource, pa):
# GH #19134
Expand Down