-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: maximum recursion error when replacing empty lists #22083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
621485b
to
c97b2ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the example as a test
pandas/core/internals/blocks.py
Outdated
@@ -1000,8 +999,8 @@ def putmask(self, mask, new, align=True, inplace=False, axis=0, | |||
if not (mask.shape[-1] == len(new) or | |||
mask[mask].shape[-1] == len(new) or | |||
len(new) == 1): | |||
raise ValueError("cannot assign mismatch " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to change? (or is this what is actually causing the issue); this is not exposed to the user, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this ValueError caused Block.replace and ObjectBlock.replace to form infinite loop and ultimately raise RecursionError. It's because Block.replace catches exception include ValueError and if it fails, delegate replace to ObjectBlock. But ObjectBlock.replace also uses super(ObjectBlock, self), hence the loop. The exception handling includes TypeError and ValueError and they are both depended on (as appear to me) by other casting cases.
Open to suggestion, not sure a better way to handle, at the moment this change is needed to address the issue in question.
This can be exposed to user when:
import pandas as pd
s = pd.Series(range(10))
mask = s > 5
s[mask] = [5, 4, 3, 2, 1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why we are catching ValueError at a higher level, maybe can get away with only catching TypeError and let ValueError buble up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point and it appears below test (only this test) fails when we only catch TypeError:
_________________ TestDataFrameReplace.test_replace_datetimetz _________________
# coerce to object
result = df.copy()
result.iloc[1, 0] = np.nan
result = result.replace(
{'A': pd.NaT}, Timestamp('20130104', tz='US/Pacific'))pandas/tests/frame/test_replace.py:1060:
elif isinstance(other, (np.datetime64, datetime, date)): other = tslibs.Timestamp(other) tz = getattr(other, 'tz', None) # test we can have an equal time zone if tz is None or str(tz) != str(self.values.tz): raise ValueError("incompatible or non tz-aware value")
E ValueError: incompatible or non tz-aware value
pandas/core/internals/blocks.py:2898: ValueError
It fails because of above ValueError which is passed on to ObjectBlock.replace as part of error handling in Block.replace. so it appears it is expected to be handled after the block is casted as object and try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback made some changes now we don't touch above error but instead make sure we don't enter infinite loop.
60de53c
to
4747c98
Compare
9506cc4
to
001c3a0
Compare
Codecov Report
@@ Coverage Diff @@
## master #22083 +/- ##
==========================================
+ Coverage 92.06% 92.06% +<.01%
==========================================
Files 169 169
Lines 50694 50698 +4
==========================================
+ Hits 46671 46675 +4
Misses 4023 4023
Continue to review full report at Codecov.
|
added test cases too from two issues linked. 🚀 @jreback gentle nudge. |
pandas/core/internals/blocks.py
Outdated
# GH 19266 and GH 21977 | ||
# ValueError triggers try except block in Block.replace | ||
# causing RecursionError | ||
raise Exception("cannot assign mismatch length " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a bit more informative from a user POV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. this change here is exposed to user so when one tries to assign an array of n-dimension to mask of m-dimension, when m != n, this error throws.
>>> import pandas as pd
>>> s = pd.Series(range(10))
>>> mask = s > 5
>>> s[mask].shape
(4,)
>>> s[mask] = [5, 4, 3, 2, 1]
File "/Users/mingli/.virtualenvs/cust-value/lib/python3.5/site-packages/pandas/core/internals.py", line 1006, in putmask
raise ValueError("cannot assign mismatch "
ValueError: cannot assign mismatch length to masked array
This is also used internally by replace
and ValueError it raises is causing infinite loop as per linked issues. Whilst changing that to not ValueError will solve it, it also makes it less clear and potentially breaks users' codes if they expect to catch ValueError running above code snippet.
Do you think it's ok to use Exception here as it would solve the issues linked but users might have to change their code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is reverted.
53bfe6d
to
9e27a38
Compare
fa1a1e5
to
167a3b7
Compare
Hello @minggli! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 07, 2018 at 13:23 Hours UTC |
3db333f
to
c8f4f72
Compare
…cursion * upstream/master: Added 0.23.4 whatsnew [ci skip] (pandas-dev#22214)
pandas/tests/frame/test_replace.py
Outdated
@@ -603,6 +603,20 @@ def test_replace_list(self): | |||
|
|||
assert_frame_equal(res, expec) | |||
|
|||
def test_replace_with_empty_list(self): | |||
# GH 21977 | |||
s = pd.Series([['a', 'b'], np.nan, [1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to include an empty list in this test case (exists in the series
module one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -667,7 +667,8 @@ Reshaping | |||
- Bug in :meth:`DataFrame.replace` raises RecursionError when converting OutOfBounds ``datetime64[ns, tz]`` (:issue:`20380`) | |||
- :func:`pandas.core.groupby.GroupBy.rank` now raises a ``ValueError`` when an invalid value is passed for argument ``na_option`` (:issue:`22124`) | |||
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`) | |||
- | |||
- Bug in :meth:`DataFrame.replace` raises RecursionError when replace empty lists (:issue:`22083`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor nits but put RecursionError
in double backticks and change replace
to replacing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, ping on green.
pandas/core/internals/blocks.py
Outdated
convert=convert) | ||
# GH 22083, TypeError or ValueError occurred within error handling | ||
# causes infinite loop. Cast and retry only if not objectblock. | ||
if self.dtype == 'object': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use is_object_dtype
pandas/core/internals/blocks.py
Outdated
# causes infinite loop. Cast and retry only if not objectblock. | ||
if self.dtype == 'object': | ||
raise | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need this else here
thanks @minggli |
git diff upstream/master -u -- "*.py" | flake8 --diff
Internal mechanism in replace method is intertwined and can cause RecursionError (not very helpful). Discussed in #21977, there doesn't appear to be first class support for ExtensionArray yet and this issue is right now treated as error handling, having investigated treating it as new use case but not easily co-exist without large change.
The RecursionError is caused by
Block.replace
andObjectBlock.replace
calling each other under certain circumstances such as in the linked issues.New approach is now to prevent re-casting of blocks to object blocks if it is already an ObjectBlock and instead raise the error caught as it is, because if it's an ObjectBlock already then ValueError and TypeError raised won't go away when re-casted to object and re-try, resulting in infinite loop and RecursionError.