Skip to content

Commit 0ebee19

Browse files
committed
COMPAT/API: DataFrame.categorize missing values
Closes dask#1565 For compatability with pandas-dev/pandas#10929 where it was decided that `pd.Categorical(['a', np.nan], categories=['a', np.nan])` Should raise a `FutureWarning`. Now we just drop missing values before computing the distincts for the categories.
1 parent 32ad1a0 commit 0ebee19

File tree

3 files changed

+13
-1
lines changed

3 files changed

+13
-1
lines changed

dask/dataframe/categorical.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ def categorize(df, columns=None, **kwargs):
3131
if not isinstance(columns, (list, tuple)):
3232
columns = [columns]
3333

34-
distincts = [df[col].drop_duplicates() for col in columns]
34+
distincts = [df[col].dropna().drop_duplicates() for col in columns]
3535
values = compute(*distincts, **kwargs)
3636

3737
func = partial(_categorize_block, categories=dict(zip(columns, values)))

dask/dataframe/tests/test_categorical.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
import warnings
2+
13
import pandas as pd
24
import pandas.util.testing as tm
35
import pytest
@@ -78,3 +80,11 @@ def test_categories():
7880

7981
df3 = dd.categorical._categorize(categories, df2)
8082
tm.assert_frame_equal(df, df3)
83+
84+
85+
def test_categorize_nan():
86+
df = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', float('nan')]}),
87+
npartitions=2)
88+
with warnings.catch_warnings(record=True) as record:
89+
df.categorize().compute()
90+
assert len(record) == 0

docs/source/changelog.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ DataFrame
99
- Return a series when functions given to ``dataframe.map_partitions`` return
1010
scalars (:pr:`1514`)
1111
- Fix type size inference for series (:pr:`1513`)
12+
- ``dataframe.DataFrame.categorize`` no longer includes missing values
13+
in the ``categories``. This is for compatibility with a `pandas change<https://github.com/pydata/pandas/pull/10929>` (:pr:`1565`)
1214
- Fix head parser error in ``dataframe.read_csv`` when some lines have quotes
1315
(:pr:`1495`)
1416
- Add ``dataframe.reduction`` and ``series.reduction`` methods to apply generic

0 commit comments

Comments
 (0)