Skip to content

Bug Ensure cache_readonly is only applied on instance #40329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Mar 9, 2021

The property cache was using the attribute _cache which occasionally is used by other classes as a class attribute. Most prominently this is true for all subclasses of PandasExtensionDtype, see

_cache: Dict[str_type, PandasExtensionDtype] = {}

This would cause the same cash to be used for the property caching as for the extension dtype. The dtype cache is a class attribute whereas the property cache is supposed to be instance only. I fixed this by using a longer name for the property caching attribute, hoping this would lead to less collisions.

Introducing a new attribute leads to failures for Cython extension types which previously were handled with an AttributeError. Firstly, I believe these cases should not silently pass and secondly the previous behaviour was wrong by not executing the function w/out cache but it returned the instance of the CachedProperty class instead. I addressed this but enabling the caching for extension classes requires defining the attribute explicitly. I went ahead and created a mixin class for this but I'm open to other suggestions as well.

  • closes N/A
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

assert instanceA.expensive_property == "A"
assert instanceA.counter == 1

assert instanceB.expensive_property == "B"
Copy link
Member Author

@fjetter fjetter Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on master this returns A

@fjetter fjetter force-pushed the cache_readonly_instance branch from b3255dd to 1e41bac Compare March 9, 2021 16:22
@fjetter fjetter added the Bug label Mar 9, 2021
@fjetter fjetter mentioned this pull request Mar 9, 2021
4 tasks
@fjetter fjetter changed the title [Bug] Ensure cache_readonly is only applied on instance Bug Ensure cache_readonly is only applied on instance Mar 9, 2021
@fjetter fjetter force-pushed the cache_readonly_instance branch 3 times, most recently from 1e0e6da to 7ccc9ca Compare March 9, 2021 16:41
@@ -601,6 +601,7 @@ ExtensionArray

- Bug in :meth:`DataFrame.where` when ``other`` is a :class:`Series` with :class:`ExtensionArray` dtype (:issue:`38729`)
- Fixed bug where :meth:`Series.idxmax`, :meth:`Series.idxmin` and ``argmax/min`` fail when the underlying data is :class:`ExtensionArray` (:issue:`32749`, :issue:`33719`, :issue:`36566`)
- Bug where some properties of subclasses of :class:``PandasExtensionDtype`` where improperly cached :issue:`40329`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where -> were

parens around the issue tag

single-ticks around PandasExtensionDtype

@jbrockmendel
Copy link
Member

Wouldn't it be simpler to just not instantiate _cache on the class?

@fjetter
Copy link
Member Author

fjetter commented Mar 10, 2021

Wouldn't it be simpler to just not instantiate _cache on the class?

We don't need a cache per instance not per class.

@fjetter fjetter force-pushed the cache_readonly_instance branch 2 times, most recently from fb5661f to d0fbc0b Compare March 10, 2021 12:53
@fjetter
Copy link
Member Author

fjetter commented Mar 10, 2021

I believe renaming the property cache is not the best idea since this is used all over the place. The actual problem is that these extension type caches

_cache: Dict[str_type, PandasExtensionDtype] = {}

are class attributes but share the same name as the commonly used instance attribute and I do not know how best to distinguish them. I would propose to rename the dtypes cache but I'm not sure if this is perceived as a breaking change or not

@jbrockmendel
Copy link
Member

I would propose to rename the dtypes cache but I'm not sure if this is perceived as a breaking change or not

It would not. Go for it.

@jbrockmendel
Copy link
Member

Why not go the other direction and change PandasExtensionDtype._cache to e.g. PandasExtensionDtype._instance_cache

@fjetter fjetter force-pushed the cache_readonly_instance branch from d0fbc0b to 628a509 Compare April 5, 2021 10:17
@fjetter
Copy link
Member Author

fjetter commented Apr 5, 2021

Sorry for not responding. I was out of office for a while.

I added a runtime check to the property cache implementation which is supposed to ensure that we're not accessing any class wide state. I'm also ensuring that we're actually using the cache for ExtensionTypes and raise an appropriate exception if we do not. This led me down the serialization implementation of the Offset classes where most implement a custom serialization function although there is one reduce implemented which should serve all and ensures that the classes are properly instantiated. Otherwise we run into this runtime check I implemented.

@jbrockmendel

@fjetter fjetter force-pushed the cache_readonly_instance branch from 628a509 to abe7603 Compare April 5, 2021 13:46
@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Apr 5, 2021
@@ -895,13 +895,6 @@ cdef class Tick(SingleConstructorOffset):

raise ApplyTypeError(f"Unhandled type: {type(other).__name__}")

# --------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what were these hitting? or were they at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly do not know if these were used or not. I ran into the issue that a lot of the BaseOffset children were not properly initialized. I ended up initializing _cache in every class but noticed they should all do fine using the __reduce__ in

def __reduce__(self):
# This __reduce__ implementation is for all BaseOffset subclasses
# except for RelativeDeltaOffset
# np.busdaycalendar objects do not pickle nicely, but we can reconstruct
# from attributes that do get pickled.
tup = tuple(
getattr(self, attr) if attr != "calendar" else None
for attr in self._attributes
)
return type(self), tup

They are all tested in

def test_pickles(current_pickle_data, legacy_pickle):
if not is_platform_little_endian():
pytest.skip("known failure on non-little endian")
version = os.path.basename(os.path.dirname(legacy_pickle))
with catch_warnings(record=True):
simplefilter("ignore")
compare(current_pickle_data, legacy_pickle, version)

with the data of

@@ -23,12 +23,29 @@ cdef class CachedProperty:
return self

# Get the cache or set a default one if needed
cache = getattr(obj, '_cache', None)
cache = instance_cache = getattr(obj, '_cache', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is any of this necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really know how to test this and implemented this runtime check. I think it's not absolutely necessary if one follows the convention properly but I believe it is very easy to make a mistake here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def test_no_cache_attribute_on_class():
   for cls in list_of_all_relevant_classes_somehow():
       assert not hasattr(cls, "_cache")

@@ -86,7 +86,7 @@ class PandasExtensionDtype(ExtensionDtype):
base: Optional[DtypeObj] = None
isbuiltin = 0
isnative = 0
_cache: Dict[str_type, PandasExtensionDtype] = {}
_cache_dtypes: Dict[str_type, PandasExtensionDtype] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in this file and in pandas/tests/dtypes/test_dtypes.py make sense, the rest i think are unnecessary and in the case of cache_readonly harmful

@fjetter
Copy link
Member Author

fjetter commented Apr 5, 2021

I get that the complexity in properties.pyx is not necessary. I wanted to have some kind of protection that this is not happening again and this runtime check is the only way I could come up with. Code runs either way with the changes in dtype but I could not come up with another way of testing this behaviour. If this is fine without a dedicated test, I can revert the cython code

@fjetter fjetter force-pushed the cache_readonly_instance branch from b46e5ee to 3cd66a2 Compare April 6, 2021 11:27
@fjetter
Copy link
Member Author

fjetter commented Apr 6, 2021

I reverted the changes in the cython code base. What's left is the renaming of the ExtensionDtype._cache

@jreback jreback added this to the 1.3 milestone Apr 8, 2021
@jreback
Copy link
Contributor

jreback commented Apr 8, 2021

@fjetter looks fine, can you merge master and ping on green

@fjetter fjetter force-pushed the cache_readonly_instance branch from 3cd66a2 to cf30dc7 Compare April 9, 2021 08:49
@fjetter
Copy link
Member Author

fjetter commented Apr 12, 2021

This was closed by #40193

@fjetter fjetter closed this Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants