BUG: break reference cycle in Index._engine #27607

crepererum · 2019-07-26T06:57:23Z

closes Index._engine creates cyclic reference #27585
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback

this is much better done by removing @cache_readonly and then use a weakref with the same types of caching (because cache_readonly does this in a very specific way)

crepererum · 2019-07-26T12:50:16Z

@jreback I don't really understand what you mean. The issue is not the cache but that self._engine_type (e.g. Int64Engine) holds a callable that refers self (i.e. the Index instance). Putting the result of self._engine_type(...) into a weakref will clear it to quickly and won't cache it. I could use a callable for self._engine_type that uses a weakref to self instead of the partial solution that I am using now, but I don't see how this relates to @cache_readonly. Can you please be more specific about your suggestion? Thanks in advance.

jreback · 2019-07-26T13:04:08Z

@jreback I don't really understand what you mean. The issue is not the cache but that self._engine_type (e.g. Int64Engine) holds a callable that refers self (i.e. the Index instance). Putting the result of self._engine_type(...) into a weakref will clear it to quickly and won't cache it. I could use a callable for self._engine_type that uses a weakref to self instead of the partial solution that I am using now, but I don't see how this relates to @cache_readonly. Can you please be more specific about your suggestion? Thanks in advance.

and I am not really sure what you suggested actually does anything. so let's revisit. what is the actual problem? having a cyclic reference in not a bug.

crepererum · 2019-07-26T14:15:27Z

and I am not really sure what you suggested actually does anything. so let's revisit. what is the actual problem? having a cyclic reference in not a bug.

You're right. I should have labeled it as a performance optimization. The issue with the reference cycle is that (similar to the one we had some time ago with the cycle between the indexer and the dataframe) that indices are not cleared from memory w/o running the GC. Under notebook / low-load conditions, this might not be an issue. For high-load system like dask / dask.distributed, this can be an issue since the python interpreter requires more memory than it should (peak memory consumption). I have actually observed exactly this issue under dask.distributed and depending on the load pattern, users can expect a temporary (i.e. until the GC finds the Index) over a gigabyte per worker. So what I would like to archive with this PR is that the index does not have a reference cycle and is cleared instantly when not used anymore.

If we agree on that change, I'll change the changelog entry and commit message before merging to make sure this is not labeled as a bug but as an enhancement.

pandas/core/indexes/base.py

jreback · 2019-07-27T14:38:08Z

and I am not really sure what you suggested actually does anything. so let's revisit. what is the actual problem? having a cyclic reference in not a bug.

You're right. I should have labeled it as a performance optimization. The issue with the reference cycle is that (similar to the one we had some time ago with the cycle between the indexer and the dataframe) that indices are not cleared from memory w/o running the GC. Under notebook / low-load conditions, this might not be an issue. For high-load system like dask / dask.distributed, this can be an issue since the python interpreter requires more memory than it should (peak memory consumption). I have actually observed exactly this issue under dask.distributed and depending on the load pattern, users can expect a temporary (i.e. until the GC finds the Index) over a gigabyte per worker. So what I would like to archive with this PR is that the index does not have a reference cycle and is cleared instantly when not used anymore.

If we agree on that change, I'll change the changelog entry and commit message before merging to make sure this is not labeled as a bug but as an enhancement.

not objecting to the change in principle. But not sure how adding an indirection different from the lambda helps here.

crepererum · 2019-07-29T07:12:49Z

But not sure how adding an indirection different from the lambda helps here.

I've changed it to a more elegant solution and added a comment why this is helpful.

doc/source/whatsnew/v0.25.1.rst

jreback · 2019-07-31T12:35:36Z

can you add an asv for this as well (the memory kind)

crepererum · 2019-07-31T13:55:25Z

can you add an asv for this as well (the memory kind)

Did so. Not sure if this is the right place, but it demonstrates the issue as well:

Before (current master):

$ asv run -e -E existing --bench gc
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Benchmarking existing-py_...
[100.00%] ··· index_object.GC.peakmem_gc_instances
[100.00%] ··· ======== ======
               param1
              -------- ------
                 1      125M
                 2      131M
                 5      155M
              ======== ======

After (this PR):

$ asv run -e -E existing --bench gc
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Benchmarking existing-py_...
[100.00%] ··· index_object.GC.peakmem_gc_instances
[100.00%] ··· ======== ======
               param1
              -------- ------
                 1      125M
                 2      125M
                 5      125M
              ======== ======

jreback

tiny comment, otherwise lgtm. don't push yet though, the CI is acting up.

doc/source/whatsnew/v0.25.1.rst

jreback · 2019-08-05T12:02:40Z

lgtm. you have some lint issues I think. ping on green.

Fixes pandas-dev#27585

crepererum · 2019-08-05T13:35:40Z

@jreback ping :)

TomAugspurger · 2019-08-08T20:43:29Z

Thanks @crepererum!

…gine

Fixes pandas-dev#27585

jreback requested changes Jul 26, 2019

View reviewed changes

jreback added the Performance Memory or execution speed performance label Jul 26, 2019

jbrockmendel reviewed Jul 26, 2019

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

crepererum force-pushed the fix/27585 branch from 6eb7299 to 61e9fdb Compare July 29, 2019 07:08

crepererum force-pushed the fix/27585 branch from 61e9fdb to 64d2f7e Compare July 29, 2019 07:14

TomAugspurger reviewed Jul 29, 2019

View reviewed changes

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved

crepererum force-pushed the fix/27585 branch from 64d2f7e to b25356d Compare July 30, 2019 08:01

jreback requested changes Jul 31, 2019

View reviewed changes

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved

crepererum force-pushed the fix/27585 branch from b25356d to e145cd5 Compare July 31, 2019 13:53

jreback approved these changes Aug 1, 2019

View reviewed changes

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved

jreback modified the milestones: 1.0, 0.25.1 Aug 1, 2019

crepererum force-pushed the fix/27585 branch 2 times, most recently from 86814d2 to 2923512 Compare August 5, 2019 11:10

PERF: break reference cycle in Index._engine

62d8719

Fixes pandas-dev#27585

crepererum force-pushed the fix/27585 branch from 2923512 to 62d8719 Compare August 5, 2019 12:49

TomAugspurger merged commit 8b6942f into pandas-dev:master Aug 8, 2019

meeseeksmachine mentioned this pull request Aug 8, 2019

Backport PR #27607 on branch 0.25.x (BUG: break reference cycle in Index._engine) #27828

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 8, 2019

Backport PR pandas-dev#27607: BUG: break reference cycle in Index._en…

2c2866b

…gine

topper-123 mentioned this pull request Aug 9, 2019

PERF: Break reference cycle for all Index types #27840

Merged

5 tasks

gfyoung pushed a commit that referenced this pull request Aug 9, 2019

Backport PR #27607: BUG: break reference cycle in Index._engine (#27828)

022d292

quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019

PERF: break reference cycle in Index._engine (pandas-dev#27607)

61dc577

Fixes pandas-dev#27585

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: break reference cycle in Index._engine #27607

BUG: break reference cycle in Index._engine #27607

crepererum commented Jul 26, 2019

jreback left a comment

crepererum commented Jul 26, 2019

jreback commented Jul 26, 2019

crepererum commented Jul 26, 2019 •

edited

Loading

jreback commented Jul 27, 2019

crepererum commented Jul 29, 2019

jreback commented Jul 31, 2019

crepererum commented Jul 31, 2019

jreback left a comment

jreback commented Aug 5, 2019

crepererum commented Aug 5, 2019

TomAugspurger commented Aug 8, 2019

BUG: break reference cycle in Index._engine #27607

BUG: break reference cycle in Index._engine #27607

Conversation

crepererum commented Jul 26, 2019

jreback left a comment

Choose a reason for hiding this comment

crepererum commented Jul 26, 2019

jreback commented Jul 26, 2019

crepererum commented Jul 26, 2019 • edited Loading

jreback commented Jul 27, 2019

crepererum commented Jul 29, 2019

jreback commented Jul 31, 2019

crepererum commented Jul 31, 2019

jreback left a comment

Choose a reason for hiding this comment

jreback commented Aug 5, 2019

crepererum commented Aug 5, 2019

TomAugspurger commented Aug 8, 2019

crepererum commented Jul 26, 2019 •

edited

Loading