REGR: Replacing a category with itself replaces it with np.nan #33288

jtilly · 2020-04-04T18:28:55Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
pd.Series(["a", "b"]).astype("category").replace("a", "a")
# 0    NaN
# 1      b
# dtype: category
# Categories (1, object): [b]

Operating on the categorical array directly, i.e. pd.Categorical(["a", "b"]).replace("a", "a") yields the same result.

Problem description

Replacing a category with itself replaces it with np.nan. This problem was introduced with 1.0.0.

Expected Output

I would have expected the behavior from 0.25.3:

pd.Series(["a", "b"]).astype("category").replace("a", "a")
# 0    a
# 1    b
# dtype: category
# Categories (2, object): [a, b]

Note that if we work with lists, we get

pd.Series(["a", "b"]).astype("category").replace(["a"], ["a"])
# dtype: category
# 0    a
# 1    b
# type: object

which is also not what I would expect, because we're now losing the dtype. This behavior has been described elsewhere (e.g. #31734 (comment)) and it's consistent with 0.25.3.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.4.0-176-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.3
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.1.3.post20200325
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.13.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-04-04T20:02:37Z

Thanks, looks like we're just removing the original value from the categories, which makes sense except in the precise case where the original and replacement are equal

simonjayhawkins · 2020-04-05T11:00:35Z

Replacing a category with itself replaces it with np.nan. This problem was introduced with 1.0.0.

regression in #27026 (i.e. 1.0.0)

fb08cee is the first bad commit
commit fb08cee
Author: Justin Zheng [email protected]
Date: Sat Nov 16 13:54:01 2019 -0800

BUG-26988 implement replace for categorical blocks (#27026)

jtilly added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2020

dsaxton added Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2020

dsaxton mentioned this issue Apr 4, 2020

REGR: Fix bug when replacing categorical value with self #33292

Merged

5 tasks

simonjayhawkins changed the title ~~BUG: Replacing a category with itself replaces it with np.nan~~ REGR: Replacing a category with itself replaces it with np.nan Apr 5, 2020

simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Bug labels Apr 5, 2020

jreback added this to the 1.1 milestone Apr 5, 2020

jreback closed this as completed in #33292 Apr 6, 2020

simonjayhawkins mentioned this issue May 5, 2020

Backport PR #33292 on branch 1.0.x (REGR: Fix bug when replacing cate… #34004

Merged

simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Replacing a category with itself replaces it with np.nan #33288

REGR: Replacing a category with itself replaces it with np.nan #33288

jtilly commented Apr 4, 2020

dsaxton commented Apr 4, 2020

simonjayhawkins commented Apr 5, 2020

REGR: Replacing a category with itself replaces it with np.nan #33288

REGR: Replacing a category with itself replaces it with np.nan #33288

Comments

jtilly commented Apr 4, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

dsaxton commented Apr 4, 2020

simonjayhawkins commented Apr 5, 2020

Output of `pd.show_versions()`