Skip to content

BUG: Categorical data fails to load from hdf when all columns are NaN #18652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 10, 2017
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ I/O
- Bug in :meth:`DataFrame.to_msgpack` when serializing data of the numpy.bool_ datatype (:issue:`18390`)
- Bug in :func:`read_json` not decoding when reading line deliminted JSON from S3 (:issue:`17200`)
- Bug in :func:`pandas.io.json.json_normalize` to avoid modification of ``meta`` (:issue:`18610`)
- Bug when storing NaN-only categorical columns in hdf5 store (:issue:`18413`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug when reading ......in a :class:`HDFStore`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done.


Plotting
^^^^^^^^
Expand Down
15 changes: 11 additions & 4 deletions pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -2137,10 +2137,17 @@ def convert(self, values, nan_rep, encoding):
# if we have stored a NaN in the categories
# then strip it; in theory we could have BOTH
# -1s in the codes and nulls :<
mask = isna(categories)
if mask.any():
categories = categories[~mask]
codes[codes != -1] -= mask.astype(int).cumsum().values
if categories is None:
# Handle case of NaN-only categorical columns in which case
# the categories are an empty array; when this is stored,
# pytables cannot write a zero-len array, so on readback
# the categories would be None and `read_hdf()` would fail.
categories = Index([], dtype=np.float64)
else:
mask = isna(categories)
if mask.any():
categories = categories[~mask]
codes[codes != -1] -= mask.astype(int).cumsum().values

self.data = Categorical.from_codes(codes,
categories=categories,
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/io/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -4928,6 +4928,23 @@ def test_categorical_conversion(self):
result = read_hdf(path, 'df', where='obsids=B')
tm.assert_frame_equal(result, expected)

def test_categorical_nan_only_columns(self):
# GH18413
# Check that read_hdf with categorical columns with NaN-only values can
# be read back.
df = pd.DataFrame({
'a': ['a', 'b', 'c', np.nan],
'b': [np.nan, np.nan, np.nan, np.nan],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another column with an array like pd.Series([None]* 3, dtype=object). this might fail your test because the original array was an all-null object type (and no float). but let's see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, seems to work fine.

'c': [1, 2, 3, 4]
})
df['a'] = df.a.astype('category')
df['b'] = df.b.astype('category')
expected = df
with ensure_clean_path(self.path) as path:
df.to_hdf(path, 'df', format='table', data_columns=True)
result = read_hdf(path, 'df')
tm.assert_frame_equal(result, expected)

def test_duplicate_column_name(self):
df = DataFrame(columns=["a", "a"], data=[[0, 0]])

Expand Down