-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was unable to reproduce this on master - can you check there and see if that resolves your issue? |
@WillAyd Master works for me on Python 3.6 but I can reproduce issue under Python 2.7.15 (OP is using 2.7.12). |
Thanks @Liam3851 . Well in that case this is a recursive function call and I think the compatibility difference is that Can you see if increasing that limit in Python2 resolves the issue? |
@WillAyd OP's test case works on Python 2 with the system recursion limit increased to 3,000 as written. If I increase the number of columns in OP's test case from 20,000 to 30,000, that re-breaks Python 2 with the maximum recursion depth (perhaps as expected). However, at least on my box (Win7, 40 GB RAM free for process) increasing the number of columns from 20,000 to 30,000 causes Python 3 to crash completely (I think this was not expected, or at least I didn't expect it). |
Mostly I was wandering why it's written in this way? Tail recursion is a beautiful thing, but in a language without tail recursion optimization it can (imho, should...) be replaced with simple loop. That will be just as fast, consume less resources and will not depend on the recursion limit. |
Sorry for multiple edits of my previous comment, phone autocorrect decided he knows best what I'm trying to say :) |
I can't speak to the history of that code but if you have a way to optimize it to perform better, use less resources, avoid the recursion limit, etc... then PRs are always welcome! |
@WillAyd Well, I've never did any PR to pandas before (and only one one-liner to numpy...) so I've thought someone with more pandas-specific experience will be better here :) but I can certainly try |
So, I've opened a PR, not sure it's 100% follows best pandas practices, but I've tried |
Code Sample, a copy-pastable example if possible
I'm currently in the middle of upgrading old system from old pandas (0.12) to the new version (0.23.0). One of the parts of the system is duplicate columns detection in medium-sized DataFrames (~100 columns, few thousand rows). We were detecting it like this
dupes = df.T.duplicated()
and previously it worked but after the upgrade, it started failing. Simplest snippet to reproduce this locallyProblem description
To the contrast of the note below, this issue isn't resolved by upgrading to the newest pandas. On the contrary, it is caused by such upgrade :) Old version I've copied below from 0.12 works on a snippet above
but the new one now fails with
Which is obviously a regression.
[this should explain why the current behaviour is a problem and why the expected output is a better solution.]
Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
Note: Many problems can be resolved by simply upgrading
pandas
to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check ifmaster
addresses this issue, but that is not necessary.For documentation-related issues, you can check the latest versions of the docs on
master
here:https://pandas-docs.github.io/pandas-docs-travis/
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
I expect no exception and return of bool Series. Example above in old pandas output this
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.21
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.5.5
patsy: 0.2.1
dateutil: 2.7.3
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.3.2.dr2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: