Skip to content

BUG: Accept dict or Series in fillna for categorical Series #18293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Nov 22, 2017

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Nov 14, 2017

@pep8speaks
Copy link

pep8speaks commented Nov 14, 2017

Hello @reidy-p! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 20, 2017 at 20:06 Hours UTC

# GH 17033
# Test fillna for a Categorical series
data = ['a', np.nan, 'b', np.nan, np.nan]
s = pd.Series(pd.Categorical(data, categories=['a', 'b']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series and Categorical have already been imported so you can remove the pd. throughout the tests you added. Instances of this in existing tests should already be covered by #18277.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll change that.

@codecov
Copy link

codecov bot commented Nov 15, 2017

Codecov Report

Merging #18293 into master will decrease coverage by 0.01%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18293      +/-   ##
==========================================
- Coverage   91.36%   91.34%   -0.02%     
==========================================
  Files         164      164              
  Lines       49721    49729       +8     
==========================================
- Hits        45429    45427       -2     
- Misses       4292     4302      +10
Flag Coverage Δ
#multiple 89.14% <94.11%> (-0.01%) ⬇️
#single 39.6% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.73% <100%> (ø) ⬆️
pandas/core/categorical.py 95.66% <93.75%> (-0.09%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d04daf...c484f49. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments. whatsnew for 0.22 (you can put in other enhancements)

@@ -1660,16 +1660,26 @@ def fillna(self, value=None, method=None, limit=None):

else:

if not isna(value) and value not in self.categories:
raise ValueError("fill value must be in categories")
if isinstance(value, ABCSeries):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a dict as well (which you can simply convert to a series)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dict has already been converted to a Series in the fillna function in pandas/core/generic.py.

So, as far as I can tell, by the time we reach the fillna function in pandas/core/categorical.py the value is either:

  1. A Series (if the user passed a Series or dict)
  2. A scalar (if the user passed a scalar)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you indicate that in a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here? otherwise lgtm.

if not isna(value) and value not in self.categories:
raise ValueError("fill value must be in categories")

mask = values == -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to share code with the array case (basically convert the scalar to an array)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about that but couldn't get it to work properly. The problem I was having was that if the user passes a scalar all the NaNs will be filled with that single scalar. But if the user passes a Series (or a dict which is then converted to a Series in the function) the NaNs will be filled with different values according to the index values:

In [1]: data = ['a', np.nan, 'b', np.nan, np.nan]
In [2]: s = pd.Series(pd.Categorical(data, categories=['a', 'b']))

In [3]: s.fillna('a')
Out[3]: 
0    a
1    a
2    b
3    a
4    a
dtype: category
Categories (2, object): [a, b]

In [4]: s.fillna(pd.Series(['a', 'b', 'b'], index=[1, 3, 4])
Out[4]:
0    a
1    a
2    b
3    b
4    b
dtype: category
Categories (2, object): [a, b]

And I was finding it difficult to deal with both cases with the same code. But I can keep thinking about it.

def test_fillna_series_categorical_errormsg(self):
data = ['a', np.nan, 'b', np.nan, np.nan]
s = pd.Series(pd.Categorical(data, categories=['a', 'b']))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests can all go in pandas/tests/series/test_missing.py. also pls move any related fillna tests that are directly on Series/DataFrame as well (to tests/dataframe/test_missing.py), if they are only on Categorical itself then leave here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the new tests relating to Series to pandas/tests/series/test_missing.py and moved the existing test_fillna from pandas/tests/test_categorical.py to pandas/tests/frame/test_missing.py

@jreback jreback added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 15, 2017
@@ -24,7 +24,7 @@ Other Enhancements

- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`)
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`)
-
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` (:issue:`17033`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a categorical dtype

mask = values == -1
if mask.any():
values = values.copy()
if isna(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify this to

values[mask] = self.categories.get_indexer([value])[0]

Copy link
Contributor Author

@reidy-p reidy-p Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_indexer seems to cause problems with PeriodIndex. When I replace the code with your suggestion:

In [1]: idx = pd.PeriodIndex(['2011-01', '2011-01', pd.NaT], freq='M')
In [2]: df = pd.DataFrame({'a': pd.Categorical(idx)})

In [3]: df.fillna(value=pd.NaT))
Out[3]:  
pandas._libs.period.IncompatibleFrequency: Input has different freq=None from PeriodIndex(freq=M)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this my be a known bug

@@ -270,6 +270,80 @@ def test_fillna(self):
pd.Timestamp('2012-11-11 00:00:00+01:00')]})
assert_frame_equal(df.fillna(method='bfill'), exp)

def test_na_actions(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment


def f():
df.fillna(value={"cats": 4, "vals": "c"})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is way too long. can you parametrize it? would make it simpler. could also break it into a couple of tests, just use descriptive names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, realized you are just moving this test! ok see what you can do here.


# If value is not a dict or Series it should be a scalar
else:
if not isna(value) and value not in self.categories:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to test if its a is_scalar before you use isna, because if its not then this will raise a different error (add a test for that as well). (you can make this an elif which is_scalar is True and raise in an else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good idea.

@jreback
Copy link
Contributor

jreback commented Nov 19, 2017

pls rebase as well (generally anytime you are pushing you should)

else:
values[mask] = self.categories.get_loc(value)
else:
raise TypeError('"value" parameter must be a scalar, dict '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have testing to hit this?

Copy link
Contributor Author

@reidy-p reidy-p Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that I can't think of an example where we would actually hit this specific line. If the user passed a list, tuple, or DataFrame as a value in fillna this will be caught by fillna in generic.py rather than in categorical.py with an identical error message to the one here. So in some ways this exception is kind of redundant because all of the work should be done in generic.py. But it might be useful to keep it anyway.

I have included tests in test_fillna_categorical_raise() in tests/series/test_missing.py to check for the TypeError when the user passes a list, tuple, or DataFrame fill value. But the tests aren't parametrized.

raise ValueError("invalid fill value with a %s" %
type(value))
raise TypeError('"value" parameter must be a scalar, dict '
'or Series, but you passed a '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we would have a parametriezed test that hits this (with multiple invalid things that should raise)

Copy link
Contributor Author

@reidy-p reidy-p Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as I said above, I have included tests in test_fillna_categorical_raise() in tests/series/test_missing.py to check for the TypeError when the user passes a list, tuple, or DataFrame fill value. But the tests aren't parametrized.

@@ -26,6 +26,7 @@ Other Enhancements
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`)
- :class:`pandas.io.formats.style.Styler` now has method ``hide_index()`` to determine whether the index will be rendered in ouptut (:issue:`14194`)
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`)
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` (:issue:`17033`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another note in api breaking changes. note that the previous exception was a ValueError, now its a TypeError (good change). use this PR number for that note.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change, rebase


.. _whatsnew_0220.api_breaking:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-
- :func:`Series.fillna` now raises a ``TypeError`` instead of a ``ValueError`` when passed a list, tuple or DataFrame as a ``value`` (`PR18293 <https://github.com/pandas-dev/pandas/pull/18293>`__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use the PR number and use issue (it works0

@jreback jreback added this to the 0.22.0 milestone Nov 22, 2017
@jreback jreback merged commit 103ea6f into pandas-dev:master Nov 22, 2017
@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

thanks @reidy-p

@reidy-p reidy-p deleted the fillna_cat_series branch November 27, 2017 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.fillna() crashes on Categorical series if value is a series
4 participants