Skip to content

BUG: list-like to_replace on Categorical.replace is ignored or crash #31734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 17, 2020

Conversation

JustinZhengBC
Copy link
Contributor

@JustinZhengBC JustinZhengBC commented Feb 6, 2020

Covers the case where to_replace is a list-like and value is a string. Other cases, like "to_replace is dict and value is None", or "to_replace and value are both lists" are handled earlier in generic.py

@@ -0,0 +1,44 @@
import pandas as pd
import pandas.testing as tm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JustinZhengBC

This should be

import pandas._testing as tm

@dsaxton
Copy link
Member

dsaxton commented Feb 6, 2020

How does this behave when to_replace is already a dict? On master the replacement seems to work but the data type changes:

In [5]: s                                                                                                                                                             
Out[5]: 
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0, 1, 2]

In [6]: s.replace({0: 1})                                                                                                                                             
Out[6]: 
0    1
1    1
2    2
dtype: int64

@JustinZhengBC
Copy link
Contributor Author

@dsaxton this is because in NDFrame (generics.py) it chooses to not dispatch the dict case to the self._data.replace, but self._data.replace_list which takes a more complex path to iterate over its blocks and would need more drastic refactoring (it pre-computes masks to avoid repeated replacement if the same value appears more than once in to_replace or value). I think that is outside the scope of the current ticket

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 8, 2020

Thanks @JustinZhengBC !

Overall looks good

Comment on lines 11 to 14
(1, 2, [2, 2, 3], True),
(1, 4, [4, 2, 3], True),
(4, 1, [1, 2, 3], True),
(5, 6, [1, 2, 3], True),
Copy link
Member

@MarcoGorelli MarcoGorelli Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the case when to_replace and value are different dtypes need testing? e.g. with to_replace=1, value="4"

EDIT

actually, maybe that's outside the scope of the issue this PR is addressing

Copy link
Contributor Author

@JustinZhengBC JustinZhengBC Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's doable, with limited dtype-checking as a full check would involve the category index which causes a crash on comparison on mixed dtypes

([1, 4], [5, 2], [5, 2, 3], False),
],
)
def test_replace(to_replace, value, expected, check_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the issue number here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -1104,6 +1105,8 @@ def assert_series_equal(
Compare datetime-like which is comparable ignoring dtype.
check_categorical : bool, default True
Whether to compare internal Categorical exactly.
check_category_order : bool, default True
Whether to compare category order internal Categoricals
Copy link
Member

@MarcoGorelli MarcoGorelli Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a missing "of" here? Should it be

Whether to compare category order of internal Categoricals

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 2415 to 2417
def replace_list(self, to_replace, value, inplace: bool = False):
return self.replace(to_replace, value, inplace)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this and the change below it were experiments in addressing the dict case and list-list case for categoricals, which turned out to be more complex than expected and are not intended to be in this pr

if is_list_like(to_replace):
if is_list_like(value):
if len(to_replace) != len(value):
raise ValueError("to_replace and value must be same length!")
Copy link
Member

@MarcoGorelli MarcoGorelli Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this ValueError be a bit more precise? As in, include something about the lengths of to_replace and value which were received

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this and the change above it were experiments in addressing the dict case and list-list case for categoricals, which turned out to be more complex than expected and are not intended to be in this pr

@pep8speaks
Copy link

pep8speaks commented Feb 8, 2020

Hello @JustinZhengBC! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-17 00:22:52 UTC

([1, 2, "3"], "5", ["5", "5", 3], True, False),
],
)
# GH 31720
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue number usually goes inside the test:

def test_me(*args):
    # GH 1234

Probably not worth changing on its own, but if you get any other comments requesting changes, perhaps this is worth doing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -99,7 +99,7 @@ Bug fixes
Categorical
^^^^^^^^^^^

-
- Bug in :class:`Categorical` that would ignore or crash when calling :meth:`Series.replace` with a list-like ``to_replace`` (:issue:`31720`)
Copy link
Member

@MarcoGorelli MarcoGorelli Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is tagged with v1.0.2 milestone, I think this should go in that whatsnew file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. move the note to 1.0.2

@jreback jreback added this to the 1.0.2 milestone Feb 9, 2020
@jreback jreback added Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels Feb 9, 2020
@JustinZhengBC
Copy link
Contributor Author

@jreback I've moved the whatsnew and addressed the other feedback

@jreback
Copy link
Contributor

jreback commented Feb 12, 2020

also pls merge master

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor comments, ping on green.

@JustinZhengBC
Copy link
Contributor Author

@jreback green

@jreback jreback merged commit 8444453 into pandas-dev:master Feb 17, 2020
@jreback
Copy link
Contributor

jreback commented Feb 17, 2020

thanks @JustinZhengBC very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.replace fails on categoricals with list
5 participants