Skip to content

ENH: avoid creating reference cycle on indexing (#15746) #17956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 27, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions asv_bench/benchmarks/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,3 +287,19 @@ def setup(self):

def time_subset(self):
self.p.ix[(self.inds, self.inds, self.inds)]


class IndexerLookup(object):
goal_time = 0.2

def setup(self):
self.s = Series(range(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add for .loc (and might as well for .ix)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.


def time_lookup_iloc(self):
self.s.iloc

def time_lookup_ix(self):
self.s.ix

def time_lookup_loc(self):
self.s.loc
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

-
- Indexers on Series or DataFrame no longer create a reference cycle (:issue:`17956`)
-
-

Expand Down
22 changes: 22 additions & 0 deletions pandas/_libs/indexing.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# cython: profile=False

cdef class _NDFrameIndexerBase:
'''
A base class for _NDFrameIndexer for fast instantiation and attribute
access.
'''
cdef public object obj, name, _ndim

def __init__(self, name, obj):
self.obj = obj
self.name = name
self._ndim = None

@property
def ndim(self):
# Delay `ndim` instantiation until required as reading it
# from `obj` isn't entirely cheap.
ndim = self._ndim
if ndim is None:
ndim = self._ndim = self.obj.ndim
return ndim
16 changes: 2 additions & 14 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# pylint: disable=W0231,E1101
import collections
import functools
import warnings
import operator
import weakref
Expand Down Expand Up @@ -1796,23 +1797,10 @@ def to_latex(self, buf=None, columns=None, col_space=None, header=True,
@classmethod
def _create_indexer(cls, name, indexer):
"""Create an indexer like _name in the class."""

if getattr(cls, name, None) is None:
iname = '_%s' % name
setattr(cls, iname, None)

def _indexer(self):
i = getattr(self, iname)
if i is None:
i = indexer(self, name)
setattr(self, iname, i)
return i

_indexer = functools.partial(indexer, name)
setattr(cls, name, property(_indexer, doc=indexer.__doc__))

# add to our internal names set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think you need to leave this. but only an asv will tell us. can you run the asv's for indexing (and add one which replicates the issue that initiated this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I ran the benchmarks. The only significant difference is that .ix becomes slower, because the deprecation warning is emitter at each instantiation. Since it's deprecated anyway, I'm not sure performance is important.

cls._internal_names_set.add(iname)

def get(self, key, default=None):
"""
Get item from object for given key (DataFrame column, Panel slice,
Expand Down
16 changes: 6 additions & 10 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from pandas.core.common import (is_bool_indexer, _asarray_tuplesafe,
is_null_slice, is_full_slice,
_values_from_object)
from pandas._libs.indexing import _NDFrameIndexerBase


# the supported indexers
Expand Down Expand Up @@ -85,19 +86,14 @@ class IndexingError(Exception):
pass


class _NDFrameIndexer(object):
class _NDFrameIndexer(_NDFrameIndexerBase):
_valid_types = None
_exception = KeyError
axis = None

def __init__(self, obj, name):
self.obj = obj
self.ndim = obj.ndim
self.name = name

def __call__(self, axis=None):
# we need to return a copy of ourselves
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you reverse these on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so that we can use functools.partial(indexer, name) to shave a few % more. If that's not important, I can revert to the original order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is not a big deal to reverse.

new_self = self.__class__(self.obj, self.name)
new_self = self.__class__(self.name, self.obj)

if axis is not None:
axis = self.obj._get_axis_number(axis)
Expand Down Expand Up @@ -1321,7 +1317,7 @@ class _IXIndexer(_NDFrameIndexer):

"""

def __init__(self, obj, name):
def __init__(self, name, obj):

_ix_deprecation_warning = textwrap.dedent("""
.ix is deprecated. Please use
Expand All @@ -1332,8 +1328,8 @@ def __init__(self, obj, name):
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated""") # noqa

warnings.warn(_ix_deprecation_warning,
DeprecationWarning, stacklevel=3)
super(_IXIndexer, self).__init__(obj, name)
DeprecationWarning, stacklevel=2)
super(_IXIndexer, self).__init__(name, obj)

def _has_valid_type(self, key, axis):
if isinstance(key, slice):
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/indexing/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import pytest

import weakref
from warnings import catch_warnings
from datetime import datetime

Expand Down Expand Up @@ -881,6 +882,14 @@ def test_partial_boolean_frame_indexing(self):
columns=list('ABC'))
tm.assert_frame_equal(result, expected)

def test_no_reference_cycle(self):
df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]})
for name in ('loc', 'iloc', 'ix', 'at', 'iat'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do a sys.getrefcount(df) before and after and assert that it doesn't change? Or is this assertion too strong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for that, the weakref-based test below already tests that no leak happens.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was only wondering if the weakref can become None "by chance" if the garbage collector happens to run exactly between the del and the assertion. This is of course highly unlikely, but my understanding was that the None result does not directly imply that there wasn't a cyclic reference. But I could be wrong about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, In that case the test would fail most of the time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why most of the time? It is very unlikely, so it would fail rarely, right?

You can check that this assertion erroneously passes with pandas 0.20.3 where there are definitely cyclic references by forcing an "accidental" GC:

In [8]:     def test_no_reference_cycle(happens_to_run_gc):
   ...:         import weakref, gc, sys, pandas as pd
   ...:         df = pd.DataFrame({'a': [0, 1], 'b': [2, 3]})
   ...:         refcount_before = sys.getrefcount(df)
   ...:         for name in ('loc', 'iloc', 'ix', 'at', 'iat'):
   ...:             getattr(df, name)
   ...:         refcount_after = sys.getrefcount(df)
   ...:         print("ref counts {} -> {}".format(refcount_before, refcount_after))
   ...:         wr = weakref.ref(df)
   ...:         del df
   ...:         if happens_to_run_gc:
   ...:             gc.collect()
   ...:         assert wr() is None
   ...:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said « the weakref can become None "by chance" ». If that's the only reason the test succeeds, then it would fail most of the time, since the chance of a GC happening between two consecutive opcodes is slim.

I'm not sure what your code snippet is supposed to prove.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to say: Checking for wr() is None does not guarantee that the operations are ref-cycle-free, which I though was the purpose of the test. That's why I would have preferred refcount_before == refcount_after but due to the low probability of an accidental success it doesn't really matter.

getattr(df, name)
wr = weakref.ref(df)
del df
assert wr() is None


class TestSeriesNoneCoercion(object):
EXPECTED_RESULTS = [
Expand Down
2 changes: 2 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,7 @@ class CheckSDist(sdist_class):
'pandas/_libs/index.pyx',
'pandas/_libs/algos.pyx',
'pandas/_libs/join.pyx',
'pandas/_libs/indexing.pyx',
'pandas/_libs/interval.pyx',
'pandas/_libs/hashing.pyx',
'pandas/_libs/testing.pyx',
Expand Down Expand Up @@ -519,6 +520,7 @@ def pxd(name):
'depends': _pxi_dep['join']},
'_libs.reshape': {'pyxfile': '_libs/reshape',
'depends': _pxi_dep['reshape']},
'_libs.indexing': {'pyxfile': '_libs/indexing'},
'_libs.interval': {'pyxfile': '_libs/interval',
'pxdfiles': ['_libs/hashtable'],
'depends': _pxi_dep['interval']},
Expand Down