-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3.5 failure in Travis looks like caused by a network glitch :( need to rerun it
|
pandas/core/sorting.py
Outdated
@@ -52,7 +52,17 @@ def _int64_cut_off(shape): | |||
return i | |||
return len(shape) | |||
|
|||
def loop(labels, shape): | |||
def maybe_lift(lab, size): # pormote nan values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sp here
can you add a 1-line comment on what this is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as with loop - I would try, though I'm not the author of original code - I've just changed it from recursion to loop, so I can't be sure I understand 100% all the nuances here...
doc/source/whatsnew/v0.23.2.txt
Outdated
@@ -60,6 +60,7 @@ Bug Fixes | |||
|
|||
- Bug in :meth:`Index.get_indexer_non_unique` with categorical key (:issue:`21448`) | |||
- Bug in comparison operations for :class:`MultiIndex` where error was raised on equality / inequality comparison involving a MultiIndex with ``nlevels == 1`` (:issue:`21149`) | |||
- Bug in calculation of group index causing "maximum recursion depth exceeded" errors during ``DataFrame.duplicated`` calls (:issue:`21524`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just say:
bug in :func:`DataFrame.duplicated`
with a large number of columns causing a 'maximum .....'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll update this
|
||
labels = list(labels) | ||
shape = list(shape) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment on the purpose of the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try, though I'm not the author of original code - I've just changed it from recursion to loop, so I can't be sure I understand 100% all the nuances here...
labels, shape = map(list, zip(*map(maybe_lift, labels, shape))) | ||
|
||
return loop(list(labels), list(shape)) | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does out need a definition outside of the loop? e.g. is it always defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is always defined here - out is assigned before the exit from the loop can happen.
And if something (though I don't know what in this case) throw an Exception - we will bypass return alltogether
can you add your example as a test |
Hello @Templarrr! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 20, 2018 at 10:25 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #21541 +/- ##
==========================================
+ Coverage 91.91% 91.92% +<.01%
==========================================
Files 153 153
Lines 49546 49564 +18
==========================================
+ Hits 45542 45560 +18
Misses 4004 4004
Continue to review full report at Codecov.
|
@jreback I've addressed all your comments the best I can. |
Also, just to doublecheck that I've updated the correct changelog - if this is merged- it will be in 0.23.2, right? |
Again :(
|
i restarted that job. ping on green. |
pandas/tests/frame/test_analytics.py
Outdated
@@ -1527,6 +1527,22 @@ def test_duplicated_with_misspelled_column_name(self, subset): | |||
with pytest.raises(KeyError): | |||
df.drop_duplicates(subset) | |||
|
|||
def test_duplicated_do_not_fail_on_wide_dataframes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test take to run? May be worth a slow tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that original test case needed to hit recursion limit... it can't be super fast.
It takes a second or two on my laptop.
I'll add the slow tag
pandas/core/sorting.py
Outdated
def maybe_lift(lab, size): | ||
# promote nan values (assigned to -1 here) | ||
# so that all values are non-negative | ||
return (lab + 1, size + 1) if (lab == -1).any() else (lab, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not obfuscate NA and 0 values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works with labels, not the original values. labels got here from
df.duplicate -> core.algorithms.factorize -> _factorize_array call and that method replace NA with -1 (na_sentinel : int, default -1
) and assign all other values >=0
Also, I did not change this in any way, it's an existing code that already in master :)
@WillAyd I've addressed your review comments |
pandas/tests/frame/test_analytics.py
Outdated
def test_duplicated_do_not_fail_on_wide_dataframes(self): | ||
# Given the wide dataframe with a lot of columns | ||
# with different (important!) values | ||
data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a dict comprehenseion
pandas/tests/frame/test_analytics.py
Outdated
for i in range(100): | ||
data['col_{0:02d}'.format(i)] = np.random.randint(0, 1000, 30000) | ||
df = pd.DataFrame(data).T | ||
# When we request to calculate duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need this comment here.
call this result=
@jreback your comments are addressed as well |
thanks @Templarrr @WillAyd lgtm. pls approve & merge when satisifed |
Thanks @Templarrr ! |
Yay, I'm pandas contributor :-D |
This just replaces tail recursion call with a simple loop. It should have no effect whatsoever on a performance but prevent hitting recursion limits on some input data ( See example in my issue here: #21524 )
git diff upstream/master -u -- "*.py" | flake8 --diff