ENH: avoid creating reference cycle on indexing (#15746) #17956

pitrou · 2017-10-23T15:49:15Z

closes CLN: make _Indexer.obj a weak ref #15746
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff

pitrou · 2017-10-23T15:52:01Z

I wonder if I should add a whatsnew entry for this, and if so, in which file?

jreback

no whatsnew needed here. this is not user visible.

jreback · 2017-10-23T16:06:38Z

pandas/_libs/lib.pyx

@@ -1839,5 +1839,27 @@ cdef class BlockPlacement:
        return self._as_slice




make a new cython module. indexing.pyx (and add to setup.py).

jreback · 2017-10-23T16:07:56Z

though you could add in the perf section if you want (but you will need to wait for us to create the 0.22 notes); releasing 0.21.0 shortly.

pitrou · 2017-10-23T16:23:04Z

The latest changes should address your comments.

jreback · 2017-10-23T17:33:52Z

pandas/core/indexing.py

-        self.obj = obj
-        self.ndim = obj.ndim
-        self.name = name
-
    def __call__(self, axis=None):
        # we need to return a copy of ourselves


did you reverse these on purpose?

Yes, so that we can use functools.partial(indexer, name) to shave a few % more. If that's not important, I can revert to the original order.

ok this is not a big deal to reverse.

codecov · 2017-10-23T22:15:16Z

Codecov Report

Merging #17956 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17956      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50113    50103      -10     
==========================================
- Hits        45723    45704      -19     
- Misses       4390     4399       +9

Flag	Coverage Δ
#multiple	`89.03% <100%> (-0.01%)`	⬇️
#single	`40.3% <71.42%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexing.py	`92.8% <100%> (-0.02%)`	⬇️
pandas/core/generic.py	`92.51% <100%> (-0.03%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1dabf3...28c5056. Read the comment docs.

codecov · 2017-10-23T22:15:21Z

Codecov Report

Merging #17956 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17956      +/-   ##
==========================================
- Coverage   91.24%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50165    50155      -10     
==========================================
- Hits        45775    45756      -19     
- Misses       4390     4399       +9

Flag	Coverage Δ
#multiple	`89.04% <100%> (-0.01%)`	⬇️
#single	`40.32% <71.42%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexing.py	`92.8% <100%> (-0.02%)`	⬇️
pandas/core/generic.py	`92.42% <100%> (-0.03%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79498e3...efe021d. Read the comment docs.

jreback · 2017-10-23T23:11:25Z

pandas/core/generic.py

            setattr(cls, name, property(_indexer, doc=indexer.__doc__))

-            # add to our internal names set


hmm I think you need to leave this. but only an asv will tell us. can you run the asv's for indexing (and add one which replicates the issue that initiated this PR).

Ok, I ran the benchmarks. The only significant difference is that .ix becomes slower, because the deprecation warning is emitter at each instantiation. Since it's deprecated anyway, I'm not sure performance is important.

pitrou · 2017-10-24T18:13:15Z

One of the Circle-Ci jobs died mysteriously, I don't know why: https://circleci.com/gh/pandas-dev/pandas/5803#tests/containers/1

jreback

small comments. actually will have you add a release note in perf section, but for 0.22, but it doesn't exist yet, will be created after release (shortly).

jreback · 2017-10-25T10:39:47Z

asv_bench/benchmarks/indexing.py

+    goal_time = 0.2
+
+    def setup(self):
+        self.s = Series(range(10))


can you add for .loc (and might as well for .ix)

bluenote10 · 2017-10-25T12:00:57Z

pandas/tests/indexing/test_indexing.py

@@ -881,6 +882,14 @@ def test_partial_boolean_frame_indexing(self):
                                columns=list('ABC'))
        tm.assert_frame_equal(result, expected)

+    def test_no_reference_cycle(self):
+        df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]})
+        for name in ('loc', 'iloc', 'ix', 'at', 'iat'):


Would it make sense to do a sys.getrefcount(df) before and after and assert that it doesn't change? Or is this assertion too strong?

There is no need for that, the weakref-based test below already tests that no leak happens.

I was only wondering if the weakref can become None "by chance" if the garbage collector happens to run exactly between the del and the assertion. This is of course highly unlikely, but my understanding was that the None result does not directly imply that there wasn't a cyclic reference. But I could be wrong about that.

Well, In that case the test would fail most of the time.

Why most of the time? It is very unlikely, so it would fail rarely, right?

You can check that this assertion erroneously passes with pandas 0.20.3 where there are definitely cyclic references by forcing an "accidental" GC:

In [8]: def test_no_reference_cycle(happens_to_run_gc): ...: import weakref, gc, sys, pandas as pd ...: df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]}) ...: refcount_before = sys.getrefcount(df) ...: for name in ('loc', 'iloc', 'ix', 'at', 'iat'): ...: getattr(df, name) ...: refcount_after = sys.getrefcount(df) ...: print("ref counts {} -> {}".format(refcount_before, refcount_after)) ...: wr = weakref.ref(df) ...: del df ...: if happens_to_run_gc: ...: gc.collect() ...: assert wr() is None ...:

You said « the weakref can become None "by chance" ». If that's the only reason the test succeeds, then it would fail most of the time, since the chance of a GC happening between two consecutive opcodes is slim.

I'm not sure what your code snippet is supposed to prove.

What I'm trying to say: Checking for wr() is None does not guarantee that the operations are ref-cycle-free, which I though was the purpose of the test. That's why I would have preferred refcount_before == refcount_after but due to the low probability of an accidental success it doesn't really matter.

jreback

can you add a note in 0.22.0 perf section

…_index_refcycle

jreback · 2017-10-27T20:33:46Z

thanks @pitrou very nice!

…andas-dev#17956)

ENH: avoid creating reference cycle on indexing (pandas-dev#15746)

9600371

jreback requested changes Oct 23, 2017

View reviewed changes

Use a dedicated indexing.pyx module in pandas._libs

ce2c6b5

jreback reviewed Oct 23, 2017

View reviewed changes

gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Oct 23, 2017

Fix failed test on Python 2.7

28c5056

Actually (hopefully) fix test failure on 2.7

39feabd

pitrou force-pushed the no_index_refcycle branch from bf8dea0 to 39feabd Compare October 23, 2017 22:42

jreback reviewed Oct 23, 2017

View reviewed changes

Add index lookup benchmark

8e28ff8

jreback approved these changes Oct 25, 2017

View reviewed changes

jreback added this to the 0.22.0 milestone Oct 25, 2017

bluenote10 reviewed Oct 25, 2017

View reviewed changes

bluenote10 mentioned this pull request Oct 25, 2017

CLN: make _Indexer.obj a weak ref #15746

Closed

Add similar benchmarks for .iloc and .ix

8f07a17

jreback requested changes Oct 27, 2017

View reviewed changes

pitrou added 2 commits October 27, 2017 13:03

Merge branch 'master' of https://github.com/pandas-dev/pandas into no…

01ce596

…_index_refcycle

Add whatsnew entry

efe021d

jreback approved these changes Oct 27, 2017

View reviewed changes

jreback merged commit 37c9cea into pandas-dev:master Oct 27, 2017

pitrou deleted the no_index_refcycle branch October 27, 2017 20:34

pitrou mentioned this pull request Oct 27, 2017

Fail to garbage collect Pandas dataframes dask/distributed#956

Closed

peterpanmj pushed a commit to peterpanmj/pandas that referenced this pull request Oct 31, 2017

ENH: avoid creating reference cycle on indexing (pandas-dev#15746) (p…

d2e8287

…andas-dev#17956)

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

ENH: avoid creating reference cycle on indexing (pandas-dev#15746) (p…

5232f97

…andas-dev#17956)

kayibal mentioned this pull request Jun 1, 2018

Support for pandas 0.23.0 datarevenue-berlin/sparsity#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: avoid creating reference cycle on indexing (#15746) #17956

ENH: avoid creating reference cycle on indexing (#15746) #17956

pitrou commented Oct 23, 2017 •

edited

Loading

pitrou commented Oct 23, 2017

jreback left a comment

jreback Oct 23, 2017

jreback commented Oct 23, 2017

pitrou commented Oct 23, 2017

jreback Oct 23, 2017

pitrou Oct 23, 2017

jreback Oct 23, 2017

codecov bot commented Oct 23, 2017

codecov bot commented Oct 23, 2017 •

edited

Loading

jreback Oct 23, 2017

pitrou Oct 24, 2017

pitrou commented Oct 24, 2017

jreback left a comment

jreback Oct 25, 2017

pitrou Oct 25, 2017

bluenote10 Oct 25, 2017

pitrou Oct 25, 2017

bluenote10 Oct 25, 2017

pitrou Oct 25, 2017

bluenote10 Oct 26, 2017

pitrou Oct 26, 2017

bluenote10 Oct 26, 2017

jreback left a comment

jreback commented Oct 27, 2017

		@@ -1839,5 +1839,27 @@ cdef class BlockPlacement:
		return self._as_slice

		setattr(cls, name, property(_indexer, doc=indexer.__doc__))

		# add to our internal names set

ENH: avoid creating reference cycle on indexing (#15746) #17956

ENH: avoid creating reference cycle on indexing (#15746) #17956

Conversation

pitrou commented Oct 23, 2017 • edited Loading

pitrou commented Oct 23, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 23, 2017

pitrou commented Oct 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 23, 2017

Codecov Report

codecov bot commented Oct 23, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Oct 24, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Oct 27, 2017

pitrou commented Oct 23, 2017 •

edited

Loading

codecov bot commented Oct 23, 2017 •

edited

Loading