Skip to content

ENH: avoid creating reference cycle on indexing (#15746) #17956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 27, 2017

Conversation

pitrou
Copy link
Contributor

@pitrou pitrou commented Oct 23, 2017

@pitrou
Copy link
Contributor Author

pitrou commented Oct 23, 2017

I wonder if I should add a whatsnew entry for this, and if so, in which file?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no whatsnew needed here. this is not user visible.

@@ -1839,5 +1839,27 @@ cdef class BlockPlacement:
return self._as_slice


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a new cython module. indexing.pyx (and add to setup.py).

@jreback
Copy link
Contributor

jreback commented Oct 23, 2017

though you could add in the perf section if you want (but you will need to wait for us to create the 0.22 notes); releasing 0.21.0 shortly.

@pitrou
Copy link
Contributor Author

pitrou commented Oct 23, 2017

The latest changes should address your comments.

self.obj = obj
self.ndim = obj.ndim
self.name = name

def __call__(self, axis=None):
# we need to return a copy of ourselves
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you reverse these on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so that we can use functools.partial(indexer, name) to shave a few % more. If that's not important, I can revert to the original order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is not a big deal to reverse.

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Oct 23, 2017
@codecov
Copy link

codecov bot commented Oct 23, 2017

Codecov Report

Merging #17956 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17956      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50113    50103      -10     
==========================================
- Hits        45723    45704      -19     
- Misses       4390     4399       +9
Flag Coverage Δ
#multiple 89.03% <100%> (-0.01%) ⬇️
#single 40.3% <71.42%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexing.py 92.8% <100%> (-0.02%) ⬇️
pandas/core/generic.py 92.51% <100%> (-0.03%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1dabf3...28c5056. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 23, 2017

Codecov Report

Merging #17956 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17956      +/-   ##
==========================================
- Coverage   91.24%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50165    50155      -10     
==========================================
- Hits        45775    45756      -19     
- Misses       4390     4399       +9
Flag Coverage Δ
#multiple 89.04% <100%> (-0.01%) ⬇️
#single 40.32% <71.42%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexing.py 92.8% <100%> (-0.02%) ⬇️
pandas/core/generic.py 92.42% <100%> (-0.03%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79498e3...efe021d. Read the comment docs.

@pitrou pitrou force-pushed the no_index_refcycle branch from bf8dea0 to 39feabd Compare October 23, 2017 22:42
setattr(cls, name, property(_indexer, doc=indexer.__doc__))

# add to our internal names set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think you need to leave this. but only an asv will tell us. can you run the asv's for indexing (and add one which replicates the issue that initiated this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I ran the benchmarks. The only significant difference is that .ix becomes slower, because the deprecation warning is emitter at each instantiation. Since it's deprecated anyway, I'm not sure performance is important.

@pitrou
Copy link
Contributor Author

pitrou commented Oct 24, 2017

One of the Circle-Ci jobs died mysteriously, I don't know why: https://circleci.com/gh/pandas-dev/pandas/5803#tests/containers/1

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments. actually will have you add a release note in perf section, but for 0.22, but it doesn't exist yet, will be created after release (shortly).

goal_time = 0.2

def setup(self):
self.s = Series(range(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add for .loc (and might as well for .ix)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

@jreback jreback added this to the 0.22.0 milestone Oct 25, 2017
@@ -881,6 +882,14 @@ def test_partial_boolean_frame_indexing(self):
columns=list('ABC'))
tm.assert_frame_equal(result, expected)

def test_no_reference_cycle(self):
df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]})
for name in ('loc', 'iloc', 'ix', 'at', 'iat'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do a sys.getrefcount(df) before and after and assert that it doesn't change? Or is this assertion too strong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for that, the weakref-based test below already tests that no leak happens.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was only wondering if the weakref can become None "by chance" if the garbage collector happens to run exactly between the del and the assertion. This is of course highly unlikely, but my understanding was that the None result does not directly imply that there wasn't a cyclic reference. But I could be wrong about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, In that case the test would fail most of the time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why most of the time? It is very unlikely, so it would fail rarely, right?

You can check that this assertion erroneously passes with pandas 0.20.3 where there are definitely cyclic references by forcing an "accidental" GC:

In [8]:     def test_no_reference_cycle(happens_to_run_gc):
   ...:         import weakref, gc, sys, pandas as pd
   ...:         df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]})
   ...:         refcount_before = sys.getrefcount(df)
   ...:         for name in ('loc', 'iloc', 'ix', 'at', 'iat'):
   ...:             getattr(df, name)
   ...:         refcount_after = sys.getrefcount(df)
   ...:         print("ref counts {} -> {}".format(refcount_before, refcount_after))
   ...:         wr = weakref.ref(df)
   ...:         del df
   ...:         if happens_to_run_gc:
   ...:             gc.collect()
   ...:         assert wr() is None
   ...:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said « the weakref can become None "by chance" ». If that's the only reason the test succeeds, then it would fail most of the time, since the chance of a GC happening between two consecutive opcodes is slim.

I'm not sure what your code snippet is supposed to prove.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to say: Checking for wr() is None does not guarantee that the operations are ref-cycle-free, which I though was the purpose of the test. That's why I would have preferred refcount_before == refcount_after but due to the low probability of an accidental success it doesn't really matter.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note in 0.22.0 perf section

@jreback jreback merged commit 37c9cea into pandas-dev:master Oct 27, 2017
@jreback
Copy link
Contributor

jreback commented Oct 27, 2017

thanks @pitrou very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLN: make _Indexer.obj a weak ref
4 participants