Skip to content

BUG: Series.fillna(..., inplace=True) causes subsequent df.sort_values() to crash for categorical dtype #35952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task
DanielFEvans opened this issue Aug 28, 2020 · 2 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@DanielFEvans
Copy link
Contributor

DanielFEvans commented Aug 28, 2020

  • [ x ] I have checked that this issue has not already been reported.

  • [ x ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

# Fillna without inplace - everything works OK.
df = pd.DataFrame({"a": [1, 5, 7, 3, 7], "b": ["a", "b", "c", "d", np.nan]})
df["b"] = df["b"].astype("category")
df["b"] = df["b"].fillna("a")
df = df.sort_values(by="a")
print(df)

# As above, but the fillna() is done inplace
df = pd.DataFrame({"a": [1, 5, 7, 3, 7], "b": ["a", "b", "c", "d", np.nan]})
df["b"] = df["b"].astype("category")
df["b"].fillna("a", inplace=True)
# df looks normal here - the np.nan has been replaced
print(df)

df = df.sort_values(by="a")
# Crash

Traceback:

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    df = df.sort_values(by="a")
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/frame.py", line 5301, in sort_values
    indexer, axis=self._get_block_manager_axis(axis), verify=False
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/managers.py", line 1415, in take
    new_axis=new_labels, indexer=indexer, axis=axis, allow_dups=True
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/managers.py", line 1259, in reindex_indexer
    for blk in self.blocks
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/managers.py", line 1259, in <listcomp>
    for blk in self.blocks
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/internals/blocks.py", line 1720, in take_nd
    new_values = self.values.take(indexer, fill_value=fill_value, allow_fill=True)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/core/series.py", line 829, in take
    nv.validate_take(tuple(), kwargs)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/compat/numpy/function.py", line 68, in __call__
    validate_kwargs(fname, kwargs, self.defaults)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/util/_validators.py", line 148, in validate_kwargs
    _check_for_invalid_keys(fname, kwargs, compat_args)
  File "/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pandas/util/_validators.py", line 122, in _check_for_invalid_keys
    raise TypeError(f"{fname}() got an unexpected keyword argument '{bad_arg}'")
TypeError: take() got an unexpected keyword argument 'allow_fill'

Problem description

If a dataframe contains a categorical column, df["col"].fillna(..., inplace=True) is called on that column, and the dataframe is subsequently sorted by a different column, Pandas will crash due to an internal error.

This error only occurs in Pandas 1.1.0/1.1.1 (latest at time of writing); Pandas 1.0.5 and below behaves correctly.

Briefly poking around the internals of Pandas, the difference seems to be that self.values.take in take_nd() is operating on a pandas.core.series.Series object in the latest Pandas (i.e. self.values is a Series), but was formerly working with a pandas.core.arrays.categorical.Categorical object in 1.0.5 and below. As the traceback states, Series does not support allow_fill in its take() function, while most other implementations of take() in Pandas do allow it.

Expected Output

The two versions of the code both run, producting the sorted, na-filled output dataframe:

   a  b
0  1  a
3  3  d
1  5  b
2  7  c
4  7  a

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.13.1.el7.x86_64
Version : #1 SMP Tue Jun 23 10:32:27 CDT 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : 5.4.3
hypothesis : 5.16.0
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.0

@DanielFEvans DanielFEvans added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 28, 2020
@DanielFEvans
Copy link
Contributor Author

DanielFEvans commented Aug 28, 2020

On staring at the backlog again (now knowing better what the bug is), I think this is the same underlying issue as #35731. In that case, it was fixed yesterday in Master.

Edit: Yes, there's a reproduction there with the same fillna example.

@DanielFEvans
Copy link
Contributor Author

Duplicate of #35731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant