-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support #28018
Conversation
…gorical support Fixes pandas-dev#27955
Several questions/thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR!
Looks very good. One suggestion I have is to add a categorical column in the first example in the docs (the df
with columns of all different types that gets created a bit below the place you edited in io.rst).
I think your analysis is correct that it would be duplicative.
Yes, I think that is nice to test. In addition, you could also test if the |
Thanks for the review! I've added another test for null, out-of-order values & unobserved category and updated the doc as well. However when I tried to add another test to check if order is preserved, it seems like it's not preserved. I'm not really sure if this is expected or if I'm missing something:
If it's expected then maybe I can add another test that expects it to still return a categorical data type without order, if needed. |
Indeed, I see the same trying your code. That is certainly not expected I think. Can you open an issue for this? (Arrow uses a JIRA tracker: https://issues.apache.org/jira/projects/ARROW/issues; I can also open an issue if you want) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional changes look good to me, thanks!
Hmm, the failure on Azure is actually related:
It seems that the |
Ah, no, I already see ;). In the |
I've opened the issue here: https://issues.apache.org/jira/browse/ARROW-6302 |
Thanks! |
6267fe0
to
aebfae7
Compare
* master: (40 commits) DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988) Remove Encoding of values in char** For Labels (pandas-dev#27618) TYPING: more type hints for io.formats.printing (pandas-dev#27765) TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051) DOC: Start 0.25.2 (pandas-dev#28111) DOC: Fix docstrings lack of punctuation (pandas-dev#28031) DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082) DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844) BUG: timedelta merge asof with tolerance (pandas-dev#27650) BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814) Preserve index when setting new column on empty dataframe. (pandas-dev#26471) BUG: Fixed groupby quantile for listlike q (pandas-dev#27827) BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074) TST: non-strict xfail for period test (pandas-dev#28072) DOC: Update whatsnew (pandas-dev#28073) CI: disable codecov (pandas-dev#28065) CI: Set SHA for codecov upload (pandas-dev#28067) BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059) CI: Add pip dependence explicitly (pandas-dev#28008) DOC: Change document code prun in a row (pandas-dev#28029) ...
@jorisvandenbossche I've added the test for the ordered flag. Is there any other test that you think we should have for now? |
doc/source/user_guide/io.rst
Outdated
@@ -4702,7 +4702,7 @@ Several caveats. | |||
indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can | |||
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine. | |||
* Index level names, if specified, must be strings. | |||
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype. | |||
* Categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as ``object`` dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true with both engines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I've just tried categorical dtypes using integer and in pyarrow it gets de-serialized into float. Not sure if it's expected, if someone can confirm it'd be great.
>>> import pandas as pd
>>> from pandas.io.parquet import read_parquet, to_parquet
>>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
>>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True)
>>> df["a"]
0 NaN
1 b
2 c
3 NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]
>>> df["b"]
0 NaN
1 2
2 3
3 NaN
Name: b, dtype: category
Categories (3, int64): [2 < 3 < 4]
>>> df.to_parquet("test_pyarrow.parquet", engine="pyarrow")
>>> actual_pyarrow = df.read_parquet("test_pyarrow.parquet", engine="fastparquet")
>>> actual_pyarrow["a"]
0 NaN
1 b
2 c
3 NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]
>>> actual_pyarrow["b"]
0 NaN
1 2.0
2 3.0
3 NaN
Name: b, dtype: float64
- With fastparquet, both string and non-string types get de-serialized back as category, however it does not preserve order for both, while in pyarrow (for string) we do preserve order. Should we make this clear in the docs?
>>> import pandas as pd
>>> from pandas.io.parquet import read_parquet, to_parquet
>>> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
>>> df["b"] = pd.Categorical([1, 2, 3, 1], categories=[2, 3, 4], ordered=True)
>>> df["a"]
0 NaN
1 b
2 c
3 NaN
Name: a, dtype: category
Categories (3, object): [b < c < d]
>>> df["b"]
0 NaN
1 2
2 3
3 NaN
Name: b, dtype: category
Categories (3, int64): [2 < 3 < 4]
>>> df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
>>> actual_fastparquet = df.read_parquet("test_fastparquet.parquet", engine="fastparquet")
>>> actual_fastparquet["a"]
0 NaN
1 b
2 c
3 NaN
Name: a, dtype: category
Categories (3, object): [b, c, d]
>>> actual_fastparquet["b"]
0 NaN
1 2
2 3
3 NaN
Name: b, dtype: category
Categories (3, int64): [2, 3, 4]
pandas/tests/io/test_parquet.py
Outdated
df["a"] = pd.Categorical(list("abcdef")) | ||
|
||
# test for null, out-of-order values, and unobserved category | ||
dtype = pd.CategoricalDtype(["foo", "bar", "baz"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write this as
b = pd.Categorical(['bar', 'foo', 'foo', 'bar', None, 'bar'], dtype=pd.CategoricalDtype(['foo', 'bar', 'baz']))
rather than using from_codes
?
pandas/tests/io/test_parquet.py
Outdated
check_round_trip(df, pa) | ||
else: | ||
# de-serialized as object for pyarrow < 0.15 | ||
expected = df.assign( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected = df.astype(object)
?
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -88,7 +88,7 @@ Categorical | |||
^^^^^^^^^^^ | |||
|
|||
- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`) | |||
- | |||
- Added test to assert roundtripping to parquet with :func:`to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame.to_parquet
For Arrow, that is expected, see https://issues.apache.org/jira/browse/ARROW-6140 (the fact that it is float and not int is because there are missing values)
That could be mentioned yes. |
I've updated the doc to explain more about the differences regarding how the categorical support works between pyarrow and fastparquet |
# de-serialized as object | ||
expected = df.assign(a=df.a.astype(object)) | ||
check_round_trip(df, pa, expected=expected) | ||
if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15.0"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche this isn't released yet, right? Should we wait to merge until 0.15.0 is released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this isn't released yet. I can assure that it runs locally for me on Arrow master (if I change the version check to > LooseVersion("0.14.1.dev")
), so I am OK to merge this, but also fine to wait a bit more (0.15.0 will normally happen somewhere end of next week)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Arrow 0.15.0 has been released, can we merge this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls merge master
@@ -174,6 +174,7 @@ Categorical | |||
- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`) | |||
- Bug in :meth:`Categorical.astype` where ``NaN`` values were handled incorrectly when casting to int (:issue:`28406`) | |||
- :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` now work on unordered categoricals also (:issue:`21667`) | |||
- Added test to assert roundtripping to parquet with :func:`DataFrame.to_parquet` or :func:`read_parquet` will preserve Categorical dtypes for string types (:issue:`27955`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this note is not really needed, but ok
doc/source/user_guide/io.rst
Outdated
@@ -4734,7 +4735,8 @@ See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ an | |||
'd': np.arange(4.0, 7.0, dtype='float64'), | |||
'e': [True, False, True], | |||
'f': pd.date_range('20130101', periods=3), | |||
'g': pd.date_range('20130101', periods=3, tz='US/Eastern')}) | |||
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'), | |||
'h': pd.Categorical(list('abc'))}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add an ordered categorical as well?
* master: (22 commits) DOC: fix PR09,PR08 errors for pandas.Timestamp (pandas-dev#28739) WEB: Add diversity note to team.md (pandas-dev#28630) DOC: Minor fixes in pandas/testing.py docstring. (pandas-dev#28752) TST: port maybe_promote tests from pandas-dev#23982 (pandas-dev#28764) Bugfix/groupby datetime issue (pandas-dev#28569) reenable codecov (pandas-dev#28750) CLN: Centralised _check_percentile (pandas-dev#27584) DEPR: Deprecate Index.set_value (pandas-dev#28621) CLN: Fix typo in contributing.rst (pandas-dev#28761) Fixed docstring errors in pandas.period range and pandas.PeriodIndex (pandas-dev#28756) BUG: Fix TypeError raised in libreduction (pandas-dev#28643) DOC: Pandas.Series.drop docstring PR02 (pandas-dev#27976) (pandas-dev#28742) DOC: Fixed doctring errors PR08, PR09 in pandas.io (pandas-dev#28748) TST: Fix broken test cases where Timedelta/Timestamp raise (pandas-dev#28729) REF: Consolidate alignment calls in DataFrame ops (pandas-dev#28638) BUG: Fix dep generation (pandas-dev#28734) Added doctstring to fixture (pandas-dev#28727) DOC: Fixed PR06 docstrings errors in pandas.timedelta_range (pandas-dev#28719) replaced safe_import with a corresponding test decorator (pandas-dev#28731) BUG: Fix RangeIndex.get_indexer for decreasing RangeIndex (pandas-dev#28680) ...
5ebb491
to
c02059d
Compare
@galuhsahid Thanks a lot for this PR! (and for the fixes on the pyarrow side as well! :-)) |
…g Categorical support (pandas-dev#28018)
…g Categorical support (pandas-dev#28018)
…g Categorical support (pandas-dev#28018)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff