-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: avoid creating reference cycle on indexing (#15746) #17956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I wonder if I should add a whatsnew entry for this, and if so, in which file? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no whatsnew needed here. this is not user visible.
pandas/_libs/lib.pyx
Outdated
@@ -1839,5 +1839,27 @@ cdef class BlockPlacement: | |||
return self._as_slice | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a new cython module. indexing.pyx (and add to setup.py).
though you could add in the perf section if you want (but you will need to wait for us to create the 0.22 notes); releasing 0.21.0 shortly. |
The latest changes should address your comments. |
self.obj = obj | ||
self.ndim = obj.ndim | ||
self.name = name | ||
|
||
def __call__(self, axis=None): | ||
# we need to return a copy of ourselves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you reverse these on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so that we can use functools.partial(indexer, name)
to shave a few % more. If that's not important, I can revert to the original order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is not a big deal to reverse.
Codecov Report
@@ Coverage Diff @@
## master #17956 +/- ##
==========================================
- Coverage 91.23% 91.22% -0.02%
==========================================
Files 163 163
Lines 50113 50103 -10
==========================================
- Hits 45723 45704 -19
- Misses 4390 4399 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17956 +/- ##
==========================================
- Coverage 91.24% 91.22% -0.02%
==========================================
Files 163 163
Lines 50165 50155 -10
==========================================
- Hits 45775 45756 -19
- Misses 4390 4399 +9
Continue to review full report at Codecov.
|
bf8dea0
to
39feabd
Compare
setattr(cls, name, property(_indexer, doc=indexer.__doc__)) | ||
|
||
# add to our internal names set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think you need to leave this. but only an asv will tell us. can you run the asv's for indexing (and add one which replicates the issue that initiated this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I ran the benchmarks. The only significant difference is that .ix
becomes slower, because the deprecation warning is emitter at each instantiation. Since it's deprecated anyway, I'm not sure performance is important.
One of the Circle-Ci jobs died mysteriously, I don't know why: https://circleci.com/gh/pandas-dev/pandas/5803#tests/containers/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments. actually will have you add a release note in perf section, but for 0.22, but it doesn't exist yet, will be created after release (shortly).
goal_time = 0.2 | ||
|
||
def setup(self): | ||
self.s = Series(range(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add for .loc
(and might as well for .ix
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done.
@@ -881,6 +882,14 @@ def test_partial_boolean_frame_indexing(self): | |||
columns=list('ABC')) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_no_reference_cycle(self): | |||
df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]}) | |||
for name in ('loc', 'iloc', 'ix', 'at', 'iat'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do a sys.getrefcount(df)
before and after and assert that it doesn't change? Or is this assertion too strong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for that, the weakref
-based test below already tests that no leak happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was only wondering if the weakref
can become None
"by chance" if the garbage collector happens to run exactly between the del
and the assertion. This is of course highly unlikely, but my understanding was that the None
result does not directly imply that there wasn't a cyclic reference. But I could be wrong about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, In that case the test would fail most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why most of the time? It is very unlikely, so it would fail rarely, right?
You can check that this assertion erroneously passes with pandas 0.20.3 where there are definitely cyclic references by forcing an "accidental" GC:
In [8]: def test_no_reference_cycle(happens_to_run_gc):
...: import weakref, gc, sys, pandas as pd
...: df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]})
...: refcount_before = sys.getrefcount(df)
...: for name in ('loc', 'iloc', 'ix', 'at', 'iat'):
...: getattr(df, name)
...: refcount_after = sys.getrefcount(df)
...: print("ref counts {} -> {}".format(refcount_before, refcount_after))
...: wr = weakref.ref(df)
...: del df
...: if happens_to_run_gc:
...: gc.collect()
...: assert wr() is None
...:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You said « the weakref can become None "by chance" ». If that's the only reason the test succeeds, then it would fail most of the time, since the chance of a GC happening between two consecutive opcodes is slim.
I'm not sure what your code snippet is supposed to prove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm trying to say: Checking for wr() is None
does not guarantee that the operations are ref-cycle-free, which I though was the purpose of the test. That's why I would have preferred refcount_before == refcount_after
but due to the low probability of an accidental success it doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note in 0.22.0 perf section
thanks @pitrou very nice! |
git diff upstream/master -u -- "*.py" | flake8 --diff