-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Accept dict or Series in fillna for categorical Series #18293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @reidy-p! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on November 20, 2017 at 20:06 Hours UTC |
pandas/tests/test_categorical.py
Outdated
# GH 17033 | ||
# Test fillna for a Categorical series | ||
data = ['a', np.nan, 'b', np.nan, np.nan] | ||
s = pd.Series(pd.Categorical(data, categories=['a', 'b'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series
and Categorical
have already been imported so you can remove the pd.
throughout the tests you added. Instances of this in existing tests should already be covered by #18277.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll change that.
Codecov Report
@@ Coverage Diff @@
## master #18293 +/- ##
==========================================
- Coverage 91.36% 91.34% -0.02%
==========================================
Files 164 164
Lines 49721 49729 +8
==========================================
- Hits 45429 45427 -2
- Misses 4292 4302 +10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments. whatsnew for 0.22 (you can put in other enhancements)
pandas/core/categorical.py
Outdated
@@ -1660,16 +1660,26 @@ def fillna(self, value=None, method=None, limit=None): | |||
|
|||
else: | |||
|
|||
if not isna(value) and value not in self.categories: | |||
raise ValueError("fill value must be in categories") | |||
if isinstance(value, ABCSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be a dict as well (which you can simply convert to a series)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dict has already been converted to a Series in the fillna
function in pandas/core/generic.py
.
So, as far as I can tell, by the time we reach the fillna
function in pandas/core/categorical.py
the value
is either:
- A Series (if the user passed a Series or dict)
- A scalar (if the user passed a scalar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, can you indicate that in a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here? otherwise lgtm.
pandas/core/categorical.py
Outdated
if not isna(value) and value not in self.categories: | ||
raise ValueError("fill value must be in categories") | ||
|
||
mask = values == -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be able to share code with the array case (basically convert the scalar to an array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about that but couldn't get it to work properly. The problem I was having was that if the user passes a scalar all the NaNs will be filled with that single scalar. But if the user passes a Series (or a dict which is then converted to a Series in the function) the NaNs will be filled with different values according to the index values:
In [1]: data = ['a', np.nan, 'b', np.nan, np.nan]
In [2]: s = pd.Series(pd.Categorical(data, categories=['a', 'b']))
In [3]: s.fillna('a')
Out[3]:
0 a
1 a
2 b
3 a
4 a
dtype: category
Categories (2, object): [a, b]
In [4]: s.fillna(pd.Series(['a', 'b', 'b'], index=[1, 3, 4])
Out[4]:
0 a
1 a
2 b
3 b
4 b
dtype: category
Categories (2, object): [a, b]
And I was finding it difficult to deal with both cases with the same code. But I can keep thinking about it.
pandas/tests/test_categorical.py
Outdated
def test_fillna_series_categorical_errormsg(self): | ||
data = ['a', np.nan, 'b', np.nan, np.nan] | ||
s = pd.Series(pd.Categorical(data, categories=['a', 'b'])) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests can all go in pandas/tests/series/test_missing.py
. also pls move any related fillna tests that are directly on Series/DataFrame as well (to tests/dataframe/test_missing.py), if they are only on Categorical itself then leave here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the new tests relating to Series to pandas/tests/series/test_missing.py
and moved the existing test_fillna
from pandas/tests/test_categorical.py
to pandas/tests/frame/test_missing.py
083efef
to
b1ac2d4
Compare
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -24,7 +24,7 @@ Other Enhancements | |||
|
|||
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`) | |||
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`) | |||
- | |||
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` (:issue:`17033`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a categorical dtype
mask = values == -1 | ||
if mask.any(): | ||
values = values.copy() | ||
if isna(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simplify this to
values[mask] = self.categories.get_indexer([value])[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_indexer
seems to cause problems with PeriodIndex
. When I replace the code with your suggestion:
In [1]: idx = pd.PeriodIndex(['2011-01', '2011-01', pd.NaT], freq='M')
In [2]: df = pd.DataFrame({'a': pd.Categorical(idx)})
In [3]: df.fillna(value=pd.NaT))
Out[3]:
pandas._libs.period.IncompatibleFrequency: Input has different freq=None from PeriodIndex(freq=M)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this my be a known bug
pandas/tests/frame/test_missing.py
Outdated
@@ -270,6 +270,80 @@ def test_fillna(self): | |||
pd.Timestamp('2012-11-11 00:00:00+01:00')]}) | |||
assert_frame_equal(df.fillna(method='bfill'), exp) | |||
|
|||
def test_na_actions(self): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment
pandas/tests/frame/test_missing.py
Outdated
|
||
def f(): | ||
df.fillna(value={"cats": 4, "vals": "c"}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is way too long. can you parametrize it? would make it simpler. could also break it into a couple of tests, just use descriptive names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, realized you are just moving this test! ok see what you can do here.
pandas/core/categorical.py
Outdated
|
||
# If value is not a dict or Series it should be a scalar | ||
else: | ||
if not isna(value) and value not in self.categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to test if its a is_scalar
before you use isna
, because if its not then this will raise a different error (add a test for that as well). (you can make this an elif which is_scalar
is True and raise in an else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a good idea.
pls rebase as well (generally anytime you are pushing you should) |
359e36a
to
0e2e8bc
Compare
else: | ||
values[mask] = self.categories.get_loc(value) | ||
else: | ||
raise TypeError('"value" parameter must be a scalar, dict ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have testing to hit this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that I can't think of an example where we would actually hit this specific line. If the user passed a list, tuple, or DataFrame as a value in fillna
this will be caught by fillna
in generic.py
rather than in categorical.py
with an identical error message to the one here. So in some ways this exception is kind of redundant because all of the work should be done in generic.py
. But it might be useful to keep it anyway.
I have included tests in test_fillna_categorical_raise()
in tests/series/test_missing.py
to check for the TypeError
when the user passes a list, tuple, or DataFrame fill value
. But the tests aren't parametrized.
raise ValueError("invalid fill value with a %s" % | ||
type(value)) | ||
raise TypeError('"value" parameter must be a scalar, dict ' | ||
'or Series, but you passed a ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we would have a parametriezed test that hits this (with multiple invalid things that should raise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as I said above, I have included tests in test_fillna_categorical_raise()
in tests/series/test_missing.py
to check for the TypeError
when the user passes a list, tuple, or DataFrame fill value
. But the tests aren't parametrized.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -26,6 +26,7 @@ Other Enhancements | |||
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`) | |||
- :class:`pandas.io.formats.style.Styler` now has method ``hide_index()`` to determine whether the index will be rendered in ouptut (:issue:`14194`) | |||
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`) | |||
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` (:issue:`17033`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add another note in api breaking changes. note that the previous exception was a ValueError
, now its a TypeError
(good change). use this PR number for that note.
eb98c94
to
4d2997b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor change, rebase
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
.. _whatsnew_0220.api_breaking: | ||
|
||
Backwards incompatible API changes | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
- | ||
- :func:`Series.fillna` now raises a ``TypeError`` instead of a ``ValueError`` when passed a list, tuple or DataFrame as a ``value`` (`PR18293 <https://github.com/pandas-dev/pandas/pull/18293>`__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use the PR number and use issue (it works0
9638c88
to
c484f49
Compare
thanks @reidy-p |
git diff upstream/master -u -- "*.py" | flake8 --diff