`fillna` using columns of dtype `category` also fills non-NaN values #26215

lcvriend · 2019-04-26T15:37:25Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

dct = {
    'A': ['a', 'b', 'c', 'b'], 
    'B': ['d', 'e', np.nan, np.nan]
}
df = pd.DataFrame.from_dict(dct).astype('category')
df['C'] = df['B']
df['C'].cat.add_categories(df['A'].cat.categories, inplace=True)
df['C'] = df['C'].fillna(df['A'])

output

	A	B	C
0	a	d	a
1	b	e	b
2	c	NaN	c
3	b	NaN	b

Problem description

I have two columns, A and B, of dtype category. Column B contains NaN values.
Applying fillna to B using A (after adding categories in A to categories in B), results in ALL values of B being overwritten with values of A. The issue is that fillna also fills non-NaN values.

Expected Output

Non-NaN values should not be overwritten:

	A	B	C
0	a	d	d
1	b	e	e
2	c	NaN	c
3	b	NaN	b

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-04-26T16:14:51Z

Makes sense - investigation and PRs would certainly be welcome

arpit1997 · 2019-04-26T21:02:55Z

@WillAyd Can I try this?

gamis · 2019-08-09T14:22:56Z

Confirmed this happening for me too.

TomAugspurger · 2019-08-12T19:08:39Z

Slightly simpler example

In [36]: s = pd.Series(pd.Categorical(['d', 'e', None, None], categories=['d', 'e', 'a', 'b', 'c']))

In [37]: s.fillna(pd.Series(['a', 'b', 'c', 'd']))
Out[37]:
0    a
1    b
2    c
3    d
dtype: category
Categories (5, object): [d, e, a, b, c]

@gamis I don't think anyone has opened a PR for this, if you're interested in working on it. Even diagnosing the issue would be valuable.

MarcoGorelli · 2019-08-15T13:09:23Z

Even simpler :)

>>> import pandas as pd
>>> pd.Categorical(['d', 'e', None, None], categories=['d', 'e', 'a', 'b', 'c']).fillna(pd.Series(['a', 'b', 'c', 'd']))
[a, b, c, d]
Categories (5, object): [d, e, a, b, c]

WillAyd added Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 26, 2019

TomAugspurger added this to the Contributions Welcome milestone Aug 12, 2019

MarcoGorelli mentioned this issue Aug 15, 2019

BUG: Ensure that fill_na in Categorical only replaces null values #27932

Merged

5 tasks

TomAugspurger closed this as completed in #27932 Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fillna` using columns of dtype `category` also fills non-NaN values #26215

`fillna` using columns of dtype `category` also fills non-NaN values #26215

lcvriend commented Apr 26, 2019

INSTALLED VERSIONS

WillAyd commented Apr 26, 2019

arpit1997 commented Apr 26, 2019

gamis commented Aug 9, 2019

TomAugspurger commented Aug 12, 2019

MarcoGorelli commented Aug 15, 2019

fillna using columns of dtype category also fills non-NaN values #26215

fillna using columns of dtype category also fills non-NaN values #26215

Comments

lcvriend commented Apr 26, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Apr 26, 2019

arpit1997 commented Apr 26, 2019

gamis commented Aug 9, 2019

TomAugspurger commented Aug 12, 2019

MarcoGorelli commented Aug 15, 2019

`fillna` using columns of dtype `category` also fills non-NaN values #26215

`fillna` using columns of dtype `category` also fills non-NaN values #26215

Output of `pd.show_versions()`