BUG: Series.fillna(..., inplace=True) causes subsequent df.sort_values() to crash for categorical dtype #35952

DanielFEvans · 2020-08-28T11:50:26Z

[ x ] I have checked that this issue has not already been reported.
[ x ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

# Fillna without inplace - everything works OK.
df = pd.DataFrame({"a": [1, 5, 7, 3, 7], "b": ["a", "b", "c", "d", np.nan]})
df["b"] = df["b"].astype("category")
df["b"] = df["b"].fillna("a")
df = df.sort_values(by="a")
print(df)

# As above, but the fillna() is done inplace
df = pd.DataFrame({"a": [1, 5, 7, 3, 7], "b": ["a", "b", "c", "d", np.nan]})
df["b"] = df["b"].astype("category")
df["b"].fillna("a", inplace=True)
# df looks normal here - the np.nan has been replaced
print(df)

df = df.sort_values(by="a")
# Crash

Traceback:

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    df = df.sort_values(by="a")
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/frame.py", line 5301, in sort_values
    indexer, axis=self._get_block_manager_axis(axis), verify=False
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/managers.py", line 1415, in take
    new_axis=new_labels, indexer=indexer, axis=axis, allow_dups=True
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/managers.py", line 1259, in reindex_indexer
    for blk in self.blocks
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/managers.py", line 1259, in <listcomp>
    for blk in self.blocks
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/blocks.py", line 1720, in take_nd
    new_values = self.values.take(indexer, fill_value=fill_value, allow_fill=True)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/series.py", line 829, in take
    nv.validate_take(tuple(), kwargs)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/compat/numpy/function.py", line 68, in __call__
    validate_kwargs(fname, kwargs, self.defaults)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/util/_validators.py", line 148, in validate_kwargs
    _check_for_invalid_keys(fname, kwargs, compat_args)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/util/_validators.py", line 122, in _check_for_invalid_keys
    raise TypeError(f"{fname}() got an unexpected keyword argument '{bad_arg}'")
TypeError: take() got an unexpected keyword argument 'allow_fill'

Problem description

If a dataframe contains a categorical column, df["col"].fillna(..., inplace=True) is called on that column, and the dataframe is subsequently sorted by a different column, Pandas will crash due to an internal error.

This error only occurs in Pandas 1.1.0/1.1.1 (latest at time of writing); Pandas 1.0.5 and below behaves correctly.

Briefly poking around the internals of Pandas, the difference seems to be that self.values.take in take_nd() is operating on a pandas.core.series.Series object in the latest Pandas (i.e. self.values is a Series), but was formerly working with a pandas.core.arrays.categorical.Categorical object in 1.0.5 and below. As the traceback states, Series does not support allow_fill in its take() function, while most other implementations of take() in Pandas do allow it.

Expected Output

The two versions of the code both run, producting the sorted, na-filled output dataframe:

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.13.1.el7.x86_64
Version : #1 SMP Tue Jun 23 10:32:27 CDT 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : 5.4.3
hypothesis : 5.16.0
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.0

The text was updated successfully, but these errors were encountered:

DanielFEvans · 2020-08-28T12:36:58Z

On staring at the backlog again (now knowing better what the bug is), I think this is the same underlying issue as #35731. In that case, it was fixed yesterday in Master.

Edit: Yes, there's a reproduction there with the same fillna example.

DanielFEvans · 2020-08-28T12:40:13Z

Duplicate of #35731

DanielFEvans added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 28, 2020

DanielFEvans closed this as completed Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Series.fillna(..., inplace=True) causes subsequent df.sort_values() to crash for categorical dtype #35952

BUG: Series.fillna(..., inplace=True) causes subsequent df.sort_values() to crash for categorical dtype #35952

DanielFEvans commented Aug 28, 2020 •

edited

Loading

INSTALLED VERSIONS

DanielFEvans commented Aug 28, 2020 •

edited

Loading

DanielFEvans commented Aug 28, 2020

BUG: Series.fillna(..., inplace=True) causes subsequent df.sort_values() to crash for categorical dtype #35952

BUG: Series.fillna(..., inplace=True) causes subsequent df.sort_values() to crash for categorical dtype #35952

Comments

DanielFEvans commented Aug 28, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

DanielFEvans commented Aug 28, 2020 • edited Loading

DanielFEvans commented Aug 28, 2020

DanielFEvans commented Aug 28, 2020 •

edited

Loading

Output of `pd.show_versions()`

DanielFEvans commented Aug 28, 2020 •

edited

Loading