"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524

Templarrr · 2018-06-18T15:17:49Z

Code Sample, a copy-pastable example if possible

I'm currently in the middle of upgrading old system from old pandas (0.12) to the new version (0.23.0). One of the parts of the system is duplicate columns detection in medium-sized DataFrames (~100 columns, few thousand rows). We were detecting it like this dupes = df.T.duplicated() and previously it worked but after the upgrade, it started failing. Simplest snippet to reproduce this locally

import numpy as np
import pandas as pd

data = {}
for i in range(70):
    data['col_{0:02d}'.format(i)] = np.random.randint(0, 1000, 20000)
df = pd.DataFrame(data)
dupes = df.T.duplicated()
print dupes

Problem description

To the contrast of the note below, this issue isn't resolved by upgrading to the newest pandas. On the contrary, it is caused by such upgrade :) Old version I've copied below from 0.12 works on a snippet above

def old_duplicated(self, cols=None, take_last=False):
    """
    Return boolean Series denoting duplicate rows, optionally only
    considering certain columns

    Parameters
    ----------
    cols : column label or sequence of labels, optional
        Only consider certain columns for identifying duplicates, by
        default use all of the columns
    take_last : boolean, default False
        Take the last observed row in a row. Defaults to the first row

    Returns
    -------
    duplicated : Series
    """

    # kludge for #1833
    def _m8_to_i8(x):
        if issubclass(x.dtype.type, np.datetime64):
            return x.view(np.int64)
        return x

    if cols is None:
        values = list(_m8_to_i8(self.values.T))
    else:
        if np.iterable(cols) and not isinstance(cols, basestring):
            if isinstance(cols, tuple):
                if cols in self.columns:
                    values = [self[cols]]
                else:
                    values = [_m8_to_i8(self[x].values) for x in cols]
            else:
                values = [_m8_to_i8(self[x].values) for x in cols]
        else:
            values = [self[cols]]

    keys = lib.fast_zip_fillna(values)
    duplicated = lib.duplicated(keys, take_last=take_last)
    return pd.Series(duplicated, index=self.index)

but the new one now fails with

Traceback (most recent call last):
  File "/home/modintsov/workspace/DataRobot/playground.py", line 56, in <module>
    dupes = df.T.duplicated()
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4384, in duplicated
    ids = get_group_index(labels, shape, sort=False, xnull=False)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 95, in get_group_index
    return loop(list(labels), list(shape))
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 86, in loop
    return loop(labels, shape)

... many-many lines of the same...

  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 60, in loop
    stride = np.prod(shape[1:nlev], dtype='i8')
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2566, in prod
    out=out, **kwargs)
RuntimeError: maximum recursion depth exceeded

Which is obviously a regression.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

I expect no exception and return of bool Series. Example above in old pandas output this

col_00    False
col_01    False
col_02    False
col_03    False
col_04    False
col_05    False
col_06    False
col_07    False
col_08    False
col_09    False
col_10    False
col_11    False
col_12    False
col_13    False
col_14    False
...
col_55    False
col_56    False
col_57    False
col_58    False
col_59    False
col_60    False
col_61    False
col_62    False
col_63    False
col_64    False
col_65    False
col_66    False
col_67    False
col_68    False
col_69    False
Length: 70, dtype: bool

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
pd.show_versions()
INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.21
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.5.5
patsy: 0.2.1
dateutil: 2.7.3
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.3.2.dr2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-06-18T15:59:01Z

I was unable to reproduce this on master - can you check there and see if that resolves your issue?

Liam3851 · 2018-06-18T17:05:01Z

@WillAyd Master works for me on Python 3.6 but I can reproduce issue under Python 2.7.15 (OP is using 2.7.12).

WillAyd · 2018-06-18T17:41:56Z

Thanks @Liam3851 . Well in that case this is a recursive function call and I think the compatibility difference is that sys.getrecursionlimit() is at 3000 for Python3 and only 1000 for Python2. Tracing this call with this size DataFrame requires 2,223 recursive calls on my end, hence the failure.

Can you see if increasing that limit in Python2 resolves the issue?

Liam3851 · 2018-06-18T18:02:45Z

@WillAyd OP's test case works on Python 2 with the system recursion limit increased to 3,000 as written.

If I increase the number of columns in OP's test case from 20,000 to 30,000, that re-breaks Python 2 with the maximum recursion depth (perhaps as expected).

However, at least on my box (Win7, 40 GB RAM free for process) increasing the number of columns from 20,000 to 30,000 causes Python 3 to crash completely (I think this was not expected, or at least I didn't expect it).

Templarrr · 2018-06-18T18:23:52Z

Mostly I was wandering why it's written in this way? Tail recursion is a beautiful thing, but in a language without tail recursion optimization it can (imho, should...) be replaced with simple loop. That will be just as fast, consume less resources and will not depend on the recursion limit.

Templarrr · 2018-06-18T18:25:25Z

Sorry for multiple edits of my previous comment, phone autocorrect decided he knows best what I'm trying to say :)

WillAyd · 2018-06-18T18:31:42Z

I can't speak to the history of that code but if you have a way to optimize it to perform better, use less resources, avoid the recursion limit, etc... then PRs are always welcome!

Templarrr · 2018-06-19T08:17:27Z

@WillAyd Well, I've never did any PR to pandas before (and only one one-liner to numpy...) so I've thought someone with more pandas-specific experience will be better here :) but I can certainly try

Templarrr · 2018-06-19T09:27:46Z

So, I've opened a PR, not sure it's 100% follows best pandas practices, but I've tried

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jun 18, 2018

WillAyd added 2/3 Compat and removed Needs Info Clarification about behavior needed to assess issue labels Jun 18, 2018

Templarrr mentioned this issue Jun 19, 2018

BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541

Merged

4 tasks

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 19, 2018

jreback added this to the 0.23.2 milestone Jun 19, 2018

WillAyd closed this as completed in #21541 Jun 21, 2018

cameronmartino mentioned this issue Apr 9, 2019

Maximum recursion depth exceeded biocore/DEICODE#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524

"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524

Templarrr commented Jun 18, 2018

[paste the output of `pd.show_versions()` here below this line]
pd.show_versions()
INSTALLED VERSIONS

WillAyd commented Jun 18, 2018

Liam3851 commented Jun 18, 2018

WillAyd commented Jun 18, 2018

Liam3851 commented Jun 18, 2018 •

edited

Loading

Templarrr commented Jun 18, 2018 •

edited

Loading

Templarrr commented Jun 18, 2018

WillAyd commented Jun 18, 2018 •

edited

Loading

Templarrr commented Jun 19, 2018

Templarrr commented Jun 19, 2018

"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524

"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524

Comments

Templarrr commented Jun 18, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] pd.show_versions() INSTALLED VERSIONS

WillAyd commented Jun 18, 2018

Liam3851 commented Jun 18, 2018

WillAyd commented Jun 18, 2018

Liam3851 commented Jun 18, 2018 • edited Loading

Templarrr commented Jun 18, 2018 • edited Loading

Templarrr commented Jun 18, 2018

WillAyd commented Jun 18, 2018 • edited Loading

Templarrr commented Jun 19, 2018

Templarrr commented Jun 19, 2018

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
pd.show_versions()
INSTALLED VERSIONS

Liam3851 commented Jun 18, 2018 •

edited

Loading

Templarrr commented Jun 18, 2018 •

edited

Loading

WillAyd commented Jun 18, 2018 •

edited

Loading