BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541

Templarrr · 2018-06-19T09:12:43Z

This just replaces tail recursion call with a simple loop. It should have no effect whatsoever on a performance but prevent hitting recursion limits on some input data ( See example in my issue here: #21524 )

closes "maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

… depth (#21524)

Templarrr · 2018-06-19T10:59:23Z

3.5 failure in Travis looks like caused by a network glitch :( need to rerun it

3.5 is not installed; attempting download
Downloading archive: https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64/python-3.5.tar.bz2
$ curl -sSf -o python-3.5.tar.bz2 ${archive_url}
curl: (56) SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

jreback · 2018-06-19T11:19:54Z

pandas/core/sorting.py

@@ -52,7 +52,17 @@ def _int64_cut_off(shape):
                return i
        return len(shape)

-    def loop(labels, shape):
+    def maybe_lift(lab, size):  # pormote nan values


sp here

can you add a 1-line comment on what this is doing

Same as with loop - I would try, though I'm not the author of original code - I've just changed it from recursion to loop, so I can't be sure I understand 100% all the nuances here...

jreback · 2018-06-19T11:20:31Z

doc/source/whatsnew/v0.23.2.txt

@@ -60,6 +60,7 @@ Bug Fixes

 - Bug in :meth:`Index.get_indexer_non_unique` with categorical key (:issue:`21448`)
 - Bug in comparison operations for :class:`MultiIndex` where error was raised on equality / inequality comparison involving a MultiIndex with ``nlevels == 1`` (:issue:`21149`)
+- Bug in calculation of group index causing "maximum recursion depth exceeded" errors during ``DataFrame.duplicated`` calls  (:issue:`21524`).


just say:

bug in :func:`DataFrame.duplicated` with a large number of columns causing a 'maximum .....'

Thanks, I'll update this

jreback · 2018-06-19T11:21:40Z

pandas/core/sorting.py

+
+    labels = list(labels)
+    shape = list(shape)
+


can you comment on the purpose of the loop

I would try, though I'm not the author of original code - I've just changed it from recursion to loop, so I can't be sure I understand 100% all the nuances here...

jreback · 2018-06-19T11:22:14Z

pandas/core/sorting.py

-        labels, shape = map(list, zip(*map(maybe_lift, labels, shape)))
-
-    return loop(list(labels), list(shape))
+    return out


does out need a definition outside of the loop? e.g. is it always defined

it is always defined here - out is assigned before the exit from the loop can happen.
And if something (though I don't know what in this case) throw an Exception - we will bypass return alltogether

jreback · 2018-06-19T11:23:11Z

can you add your example as a test

pep8speaks · 2018-06-19T12:05:37Z

Hello @Templarrr! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 20, 2018 at 10:25 Hours UTC

codecov · 2018-06-19T12:05:45Z

Codecov Report

Merging #21541 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21541      +/-   ##
==========================================
+ Coverage   91.91%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49564      +18     
==========================================
+ Hits        45542    45560      +18     
  Misses       4004     4004

Flag	Coverage Δ
#multiple	`90.32% <100%> (ø)`	⬆️
#single	`41.8% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/sorting.py	`98.2% <100%> (ø)`	⬆️
pandas/core/arrays/categorical.py	`95.63% <0%> (-0.06%)`	⬇️
pandas/core/generic.py	`96.12% <0%> (-0.01%)`	⬇️
pandas/core/indexes/base.py	`96.62% <0%> (ø)`	⬆️
pandas/core/series.py	`94.19% <0%> (+0.01%)`	⬆️
pandas/core/indexes/category.py	`97.28% <0%> (+0.18%)`	⬆️
pandas/core/dtypes/cast.py	`88.49% <0%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2625759...eebb8cf. Read the comment docs.

Templarrr · 2018-06-19T12:11:07Z

@jreback I've addressed all your comments the best I can.
Changelog updated, snipped used in the issue added as a test case (with minor changes to accomodate to different py3 recursion settings)
Comments to the loop and internal method added describing what is going on there as best as I understand it.

Templarrr · 2018-06-19T12:12:32Z

Also, just to doublecheck that I've updated the correct changelog - if this is merged- it will be in 0.23.2, right?

Templarrr · 2018-06-19T13:07:48Z

Again :(

3.5 is not installed; attempting download
Downloading archive: https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64/python-3.5.tar.bz2
$ curl -sSf -o python-3.5.tar.bz2 ${archive_url}
curl: (56) SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

jreback · 2018-06-19T20:31:33Z

i restarted that job. ping on green.

WillAyd · 2018-06-20T01:02:32Z

pandas/tests/frame/test_analytics.py

@@ -1527,6 +1527,22 @@ def test_duplicated_with_misspelled_column_name(self, subset):
        with pytest.raises(KeyError):
            df.drop_duplicates(subset)

+    def test_duplicated_do_not_fail_on_wide_dataframes(self):


How long does this test take to run? May be worth a slow tag

Given that original test case needed to hit recursion limit... it can't be super fast.
It takes a second or two on my laptop.
I'll add the slow tag

WillAyd · 2018-06-20T01:05:26Z

pandas/core/sorting.py

+    def maybe_lift(lab, size):
+        # promote nan values (assigned to -1 here)
+        # so that all values are non-negative
+        return (lab + 1, size + 1) if (lab == -1).any() else (lab, size)


Does this not obfuscate NA and 0 values?

this works with labels, not the original values. labels got here from
df.duplicate -> core.algorithms.factorize -> _factorize_array call and that method replace NA with -1 (na_sentinel : int, default -1) and assign all other values >=0
Also, I did not change this in any way, it's an existing code that already in master :)

Templarrr · 2018-06-20T09:55:23Z

@WillAyd I've addressed your review comments

jreback · 2018-06-20T10:06:41Z

pandas/tests/frame/test_analytics.py

+    def test_duplicated_do_not_fail_on_wide_dataframes(self):
+        # Given the wide dataframe with a lot of columns
+        # with different (important!) values
+        data = {}


make this a dict comprehenseion

jreback · 2018-06-20T10:06:55Z

pandas/tests/frame/test_analytics.py

+        for i in range(100):
+            data['col_{0:02d}'.format(i)] = np.random.randint(0, 1000, 30000)
+        df = pd.DataFrame(data).T
+        # When we request to calculate duplicates


don't need this comment here.
call this result=

Templarrr · 2018-06-20T10:16:59Z

@jreback your comments are addressed as well

jreback · 2018-06-20T10:25:46Z

thanks @Templarrr

@WillAyd lgtm. pls approve & merge when satisifed

WillAyd · 2018-06-21T02:54:37Z

Thanks @Templarrr !

Templarrr · 2018-06-21T08:40:13Z

Yay, I'm pandas contributor :-D
Thanks for the reviews and comments!

… depth (#21541) (cherry picked from commit f91a704)

… depth (pandas-dev#21541)

BUG: Fix group index calculation to prevent hitting maximum recursion…

ee30e60

… depth (#21524)

jreback requested changes Jun 19, 2018

View reviewed changes

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 19, 2018

Michael Odintsov added 2 commits June 19, 2018 14:53

Adding more comments and updated changelog

91a284e

Added testcase

fac97d5

linter

7e9f315

jreback added the Bug label Jun 19, 2018

jreback added this to the 0.23.2 milestone Jun 19, 2018

WillAyd requested changes Jun 20, 2018

View reviewed changes

address PR review comments (better comment and mark test as slow)

727e654

jreback requested changes Jun 20, 2018

View reviewed changes

address more PR review comments (list comprehension and result name)

befa65d

jreback added 2 commits June 20, 2018 06:23

Merge branch 'master' into PR_TOOL_MERGE_PR_21541

c02b188

doc

eebb8cf

jreback approved these changes Jun 20, 2018

View reviewed changes

WillAyd approved these changes Jun 21, 2018

View reviewed changes

WillAyd merged commit f91a704 into pandas-dev:master Jun 21, 2018

jorisvandenbossche added Needs Backport and removed Needs Backport labels Jun 29, 2018

jorisvandenbossche pushed a commit that referenced this pull request Jun 29, 2018

BUG: Fix group index calculation to prevent hitting maximum recursion…

ff7d84a

… depth (#21541) (cherry picked from commit f91a704)

jorisvandenbossche pushed a commit that referenced this pull request Jul 2, 2018

BUG: Fix group index calculation to prevent hitting maximum recursion…

172c515

… depth (#21541) (cherry picked from commit f91a704)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: Fix group index calculation to prevent hitting maximum recursion…

5d3a7c9

… depth (pandas-dev#21541)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541

BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541

Templarrr commented Jun 19, 2018 •

edited

Loading

Templarrr commented Jun 19, 2018 •

edited

Loading

jreback Jun 19, 2018

Templarrr Jun 19, 2018

jreback Jun 19, 2018

Templarrr Jun 19, 2018

jreback Jun 19, 2018

Templarrr Jun 19, 2018

jreback Jun 19, 2018

Templarrr Jun 19, 2018

jreback commented Jun 19, 2018

pep8speaks commented Jun 19, 2018 •

edited

Loading

codecov bot commented Jun 19, 2018 •

edited

Loading

Templarrr commented Jun 19, 2018 •

edited

Loading

Templarrr commented Jun 19, 2018

Templarrr commented Jun 19, 2018

jreback commented Jun 19, 2018

WillAyd Jun 20, 2018

Templarrr Jun 20, 2018

WillAyd Jun 20, 2018

Templarrr Jun 20, 2018 •

edited

Loading

Templarrr commented Jun 20, 2018 •

edited

Loading

jreback Jun 20, 2018

jreback Jun 20, 2018

Templarrr commented Jun 20, 2018

jreback commented Jun 20, 2018

WillAyd commented Jun 21, 2018

Templarrr commented Jun 21, 2018

BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541

BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541

Conversation

Templarrr commented Jun 19, 2018 • edited Loading

Templarrr commented Jun 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 19, 2018

pep8speaks commented Jun 19, 2018 • edited Loading

Comment last updated on June 20, 2018 at 10:25 Hours UTC

codecov bot commented Jun 19, 2018 • edited Loading

Codecov Report

Templarrr commented Jun 19, 2018 • edited Loading

Templarrr commented Jun 19, 2018

Templarrr commented Jun 19, 2018

jreback commented Jun 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Templarrr Jun 20, 2018 • edited Loading

Choose a reason for hiding this comment

Templarrr commented Jun 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Templarrr commented Jun 20, 2018

jreback commented Jun 20, 2018

WillAyd commented Jun 21, 2018

Templarrr commented Jun 21, 2018

Templarrr commented Jun 19, 2018 •

edited

Loading

Templarrr commented Jun 19, 2018 •

edited

Loading

pep8speaks commented Jun 19, 2018 •

edited

Loading

codecov bot commented Jun 19, 2018 •

edited

Loading

Templarrr commented Jun 19, 2018 •

edited

Loading

Templarrr Jun 20, 2018 •

edited

Loading

Templarrr commented Jun 20, 2018 •

edited

Loading