Skip to content

BUG: break reference cycle in Index._engine #27607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 8, 2019

Conversation

crepererum
Copy link
Contributor

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is much better done by removing @cache_readonly and then use a weakref with the same types of caching (because cache_readonly does this in a very specific way)

@jreback jreback added the Performance Memory or execution speed performance label Jul 26, 2019
@crepererum
Copy link
Contributor Author

@jreback I don't really understand what you mean. The issue is not the cache but that self._engine_type (e.g. Int64Engine) holds a callable that refers self (i.e. the Index instance). Putting the result of self._engine_type(...) into a weakref will clear it to quickly and won't cache it. I could use a callable for self._engine_type that uses a weakref to self instead of the partial solution that I am using now, but I don't see how this relates to @cache_readonly. Can you please be more specific about your suggestion? Thanks in advance.

@jreback
Copy link
Contributor

jreback commented Jul 26, 2019

@jreback I don't really understand what you mean. The issue is not the cache but that self._engine_type (e.g. Int64Engine) holds a callable that refers self (i.e. the Index instance). Putting the result of self._engine_type(...) into a weakref will clear it to quickly and won't cache it. I could use a callable for self._engine_type that uses a weakref to self instead of the partial solution that I am using now, but I don't see how this relates to @cache_readonly. Can you please be more specific about your suggestion? Thanks in advance.

and I am not really sure what you suggested actually does anything. so let's revisit. what is the actual problem? having a cyclic reference in not a bug.

@crepererum
Copy link
Contributor Author

crepererum commented Jul 26, 2019

and I am not really sure what you suggested actually does anything. so let's revisit. what is the actual problem? having a cyclic reference in not a bug.

You're right. I should have labeled it as a performance optimization. The issue with the reference cycle is that (similar to the one we had some time ago with the cycle between the indexer and the dataframe) that indices are not cleared from memory w/o running the GC. Under notebook / low-load conditions, this might not be an issue. For high-load system like dask / dask.distributed, this can be an issue since the python interpreter requires more memory than it should (peak memory consumption). I have actually observed exactly this issue under dask.distributed and depending on the load pattern, users can expect a temporary (i.e. until the GC finds the Index) over a gigabyte per worker. So what I would like to archive with this PR is that the index does not have a reference cycle and is cleared instantly when not used anymore.

If we agree on that change, I'll change the changelog entry and commit message before merging to make sure this is not labeled as a bug but as an enhancement.

@jreback
Copy link
Contributor

jreback commented Jul 27, 2019

and I am not really sure what you suggested actually does anything. so let's revisit. what is the actual problem? having a cyclic reference in not a bug.

You're right. I should have labeled it as a performance optimization. The issue with the reference cycle is that (similar to the one we had some time ago with the cycle between the indexer and the dataframe) that indices are not cleared from memory w/o running the GC. Under notebook / low-load conditions, this might not be an issue. For high-load system like dask / dask.distributed, this can be an issue since the python interpreter requires more memory than it should (peak memory consumption). I have actually observed exactly this issue under dask.distributed and depending on the load pattern, users can expect a temporary (i.e. until the GC finds the Index) over a gigabyte per worker. So what I would like to archive with this PR is that the index does not have a reference cycle and is cleared instantly when not used anymore.

If we agree on that change, I'll change the changelog entry and commit message before merging to make sure this is not labeled as a bug but as an enhancement.

not objecting to the change in principle. But not sure how adding an indirection different from the lambda helps here.

@crepererum
Copy link
Contributor Author

But not sure how adding an indirection different from the lambda helps here.

I've changed it to a more elegant solution and added a comment why this is helpful.

@jreback
Copy link
Contributor

jreback commented Jul 31, 2019

can you add an asv for this as well (the memory kind)

@crepererum
Copy link
Contributor Author

can you add an asv for this as well (the memory kind)

Did so. Not sure if this is the right place, but it demonstrates the issue as well:

Before (current master):

$ asv run -e -E existing --bench gc
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Benchmarking existing-py_...
[100.00%] ··· index_object.GC.peakmem_gc_instances
[100.00%] ··· ======== ======
               param1
              -------- ------
                 1      125M
                 2      131M
                 5      155M
              ======== ======

After (this PR):

$ asv run -e -E existing --bench gc
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Benchmarking existing-py_...
[100.00%] ··· index_object.GC.peakmem_gc_instances
[100.00%] ··· ======== ======
               param1
              -------- ------
                 1      125M
                 2      125M
                 5      125M
              ======== ======

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comment, otherwise lgtm. don't push yet though, the CI is acting up.

@jreback jreback modified the milestones: 1.0, 0.25.1 Aug 1, 2019
@crepererum crepererum force-pushed the fix/27585 branch 2 times, most recently from 86814d2 to 2923512 Compare August 5, 2019 11:10
@jreback
Copy link
Contributor

jreback commented Aug 5, 2019

lgtm. you have some lint issues I think. ping on green.

@crepererum
Copy link
Contributor Author

@jreback ping :)

@TomAugspurger TomAugspurger merged commit 8b6942f into pandas-dev:master Aug 8, 2019
@TomAugspurger
Copy link
Contributor

Thanks @crepererum!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index._engine creates cyclic reference
4 participants