ENH: Added public accessor registrar #18827

TomAugspurger · 2017-12-18T21:17:00Z

Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing #18767
outside of pandas.

If we accept this I'm not sure it belongs in the top-level namespace where I have it currently. What would a good home be for this be? pd.api.extensions?

Closes #14781

Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781

pep8speaks · 2017-12-18T21:17:02Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 15, 2018 at 12:34 Hours UTC

gfyoung · 2017-12-18T21:30:56Z

@TomAugspurger : A couple of points:

Re: your question, it depends on how much importance you put on the resource. Personally, I'm fine with exposing it in the top-level namespace. xarray does this, so one could argue that we should do the same for consistency. Unless you're using this solely for internal stuff, then I don't see why not.
Could you explain the differences / changes you needed to make for this to port over to pandas ? That would be very useful for understanding why you needed to add this to pandas.

gfyoung · 2017-12-18T21:33:53Z

doc/source/internals.rst

+      >>> ds.geo.center
+      (5.0, 10.0)
+      >>> ds.geo.plot()
+      # plots data on a map


To make this less abstract, you probably will need plots and some more concrete code to actually plot these points.

I'm OK with this being abstract. The point is the accessor, not the functionality provided by it, if that makes sense.

jbrockmendel · 2017-12-18T22:55:44Z

There might be some useful things to pull out of the tests for the now-abandoned #17042. I wrote a moderately fleshed out example for that.

codecov · 2017-12-19T01:21:37Z

Codecov Report

Merging #18827 into master will decrease coverage by 0.01%.
The diff coverage is 93.91%.

@@            Coverage Diff             @@
##           master   #18827      +/-   ##
==========================================
- Coverage   91.55%   91.53%   -0.02%     
==========================================
  Files         147      148       +1     
  Lines       48812    48837      +25     
==========================================
+ Hits        44690    44704      +14     
- Misses       4122     4133      +11

Flag	Coverage Δ
#multiple	`89.91% <93.91%> (-0.02%)`	⬇️
#single	`41.7% <57.39%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.62% <100%> (ø)`	⬆️
pandas/api/extensions/__init__.py	`100% <100%> (ø)`
pandas/core/accessor.py	`98.7% <100%> (+4.95%)`	⬆️
pandas/core/series.py	`94.61% <100%> (ø)`	⬆️
pandas/core/indexes/base.py	`96.46% <100%> (ø)`	⬆️
pandas/errors/__init__.py	`100% <100%> (ø)`	⬆️
pandas/core/categorical.py	`95.78% <100%> (ø)`	⬆️
pandas/core/base.py	`96.77% <100%> (ø)`	⬆️
pandas/core/indexes/accessors.py	`90% <88%> (+0.63%)`	⬆️
pandas/core/strings.py	`98.17% <92.85%> (-0.3%)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 787ab55...fd40244. Read the comment docs.

shoyer · 2017-12-19T02:16:07Z

pandas/core/accessor.py

+
+
+# Ported with modifications from xarray
+# https://github.com/pydata/xarray/blob/master/xarray/core/extensions.py


Indeed, please do let me know exactly what you needed to change :)

Here's the relevant diff between xarray and pandas:

@TomAugspurger : For better or worse, that's a much less significant change than I expected. Maybe should add a comment about that.

jreback · 2017-12-19T11:10:18Z

this should not be in the main public namespace.

TomAugspurger · 2017-12-19T15:28:31Z

Here's the relevant diff between xarray and pandas:

diff --git a/xarray.py b/pandas.py
index 48b2cba..fc3dd6b 100644
--- a/xarray.py
+++ b/pandas.py
@@ -11,9 +11,10 @@ class _CachedAccessor(object):
         try:
             accessor_obj = self._accessor(obj)
         except AttributeError:
-            # __getattr__ on data object will swallow any AttributeErrors raised
-            # when initializing the accessor, so we need to raise as something
-            # else (GH933):
+            # TODO
+            # __getattr__ on data object will swallow any AttributeErrors
+            # raised when initializing the accessor, so we need to raise
+            # as something else (GH933):
             msg = 'error initializing %r accessor.' % self._name
             if PY2:
                 msg += ' Full traceback:\n' + traceback.format_exc()
@@ -30,9 +31,9 @@ def _register_accessor(name, cls):
     def decorator(accessor):
         if hasattr(cls, name):
             warnings.warn(
-                'registration of accessor %r under name %r for type %r is '
-                'overriding a preexisting attribute with the same name.'
-                % (accessor, name, cls),
+                'registration of accessor {!r} under name {!r} for type '
+                '{!r} is overriding a preexisting attribute with the same '
+                'name.'.format(accessor, name, cls),
                 AccessorRegistrationWarning,
                 stacklevel=2)
         setattr(cls, name, _CachedAccessor(name, accessor))

(my TODO is for whether or not that matters for pandas). Otherwise, we're trying to use new-style string formatting.

jreback · 2017-12-19T23:42:53Z

pandas/core/accessor.py

+# https://github.com/pydata/xarray/blob/master/xarray/core/extensions.py
+
+
+class _CachedAccessor(object):


Is there a reason you are not using this for the internal accessors?

jreback · 2017-12-19T23:43:52Z

pandas/extensions/__init__.py

@@ -0,0 +1,4 @@
+"""Public API for extending panadas objects."""


this is better, but still not a fan of extending the top-level namespaces (e.g. with extensions). api.extensions would be better

jreback · 2017-12-19T23:44:02Z

pandas/tests/api/test_api.py

@@ -41,6 +41,13 @@ class TestPDApi(Base):
    # misc
    misc = ['IndexSlice', 'NaT']

+    # extension points


jreback · 2017-12-21T19:46:44Z

pandas/tests/test_register_accessor.py

+from pandas.errors import AccessorRegistrationWarning
+
+
+@contextlib.contextmanager


you can just use the pytest monkeypatch fixture for this

TomAugspurger · 2018-01-02T17:41:17Z

Updated.

xarray had to catch AttributeError raised when creating the accessor due to how their __getitem__ works. I don't think pandas needs to do this (rc 0.8: register_dataset_accessor silently ignores AttributeError pydata/xarray#933)
The extra level for pd.api.extensions rather than pd.extensions seems unnecessary (flat is better than nested and all that), but OK either way
pytest.monkeypatch seems to be for functions, not instances of objects, unless I'm mistaken
I have a slight preference to focusing this PR on just adding the register_*_accessor, but I can also refactor the internals to use it. Really though, these are independent. The only real change would be to pd.api.extensions.resgister_series_accessor("str")(StringMethods) instead of str = accessor.AccessorProperty(strings.StringMethods). The register* decorators are just for registering, not for implementing accessors themselves.

jreback · 2018-01-04T00:45:41Z

I have a slight preference to focusing this PR on just adding the register__accessor, but I can also refactor the internals to use it. Really though, these are independent. The only real change would be to pd.api.extensions.resgister_series_accessor("str")(StringMethods) instead of str = accessor.AccessorProperty(strings.StringMethods). The register decorators are just for registering, not for implementing accessors themselves.

I disagree. The best use of this is to actually implement the internals. If it works, great, if not then you have found a bug / feature lacking.

TomAugspurger · 2018-01-04T20:37:52Z

Refactored to have our accessors use it.

All the accessors are registered in pandas.core.register_accessors after the classes have been initialized
I ran into a subtle issue around caching. The accessor from xarray is cached such that x.foo is the same object everytime. With our old AccessorProperty, we recreated the accessor every time. This caused some issues when the object itself was mutable, but the accessor cached some state. Notably, this would occur:

In [9]: s = pd.Series(pd.TimedeltaIndex(['5S', '10S']))

In [10]: s.dt.values.isna()
Out[10]: array([False, False], dtype=bool)

In [11]: s[0] = float('nan')

# isna has been cached on the accessor's .values, but not invalidated
# when s was mutated
In [12]: s.dt.values.isna()
Out[12]: array([ False, False], dtype=bool)

To work around this, I added a cache keyword argument (@shoyer would xarray be interested in this keyword?)

Moved the new docs to "developer.rst" instead of internals, since it's for "downstream applications" (we should probably move the subclassing docs there)
Removed the now unused AccessorProperty class

shoyer · 2018-01-04T20:52:58Z

I ran into a subtle issue around caching. The accessor from xarray is cached such that x.foo is the same object everytime. With our old AccessorProperty, we recreated the accessor every time. This caused some issues when the object itself was mutable, but the accessor cached some state.

To work around this, I added a cache keyword argument (@shoyer would xarray be interested in this keyword?)

This seems like an inadvertent detail of the implementation of Series.dt.values (why do we even have that property?). The datetime accessor creates a DatetimeIndex internally for method/properties. DatetimeIndex.values is fixed, because pandas.Index objects are immutable.

So the other fix would be to change how the .dt accessor implemented so that DatetimeIndex objects are created on demand when a method/property is called, instead of being created once and saved on the acccessor.

I don't think we need a cache=False option for xarray since nobody has written any accessors relying on this behavior.

jbrockmendel · 2018-01-04T20:55:15Z

Making it easy to extend functionality is great, but the new commit that changes where the current str/cat/dt attributes are defined I'm very much -1 on. Inheritance notwithstanding, the class definition should be WYSIWYG. Newbies shouldn't have to track down where e.g. DataFrame.pivot_table is defined.

TomAugspurger · 2018-01-04T21:01:36Z

This seems like an inadvertent detail of the implementation of Series.dt.values (why do we even have that property?).

.values isn't actually in dir(s.dt). Should probably be ._values but that's how it was before. Regardless, I'll see how difficult it is to create them on demand.

the class definition should be WYSIWYG

That's my preference as well, we just won't be able to / have to use the register_*_accessor functions, since those rely on the classes being defined.

1. Removed optional caching 2. Refactored `Properties` to create the indexes it uses on demand 3. Moved accessor definitions to classes for clarity

TomAugspurger · 2018-01-04T22:40:04Z

doc/source/whatsnew/v0.23.0.txt

@@ -119,6 +119,56 @@ Current Behavior

    s.rank(na_option='top')

+
+Extending Pandas Objects with New Accessors


Thoughts on removing this section in the whatsnew? This may be too niche for most users to care about.

jreback

looks pretty good. some api consistency questions.

jreback · 2018-01-05T13:51:42Z

pandas/core/accessor.py

+# 1. We don't need to catch and re-raise AttributeErrors as RuntimeErrors
+
+
+class _CachedAccssor(object):


name this: CachedAccessor we use this internally so private is not appropriate

jreback · 2018-01-05T13:52:54Z

pandas/core/accessor.py

+        # NDFrame
+        object.__setattr__(obj, self._name, accessor_obj)
+        return accessor_obj
+


it might make sense to have a few methods here which are AbstractMethodError to guide the api, eg. validate

I'd prefer to leave that open to the person implementing the accessor. Perhaps for some accessors, there's no need for validation.

this is too general we need to be opionated, if there is no validation then that should be explicit.

Not sure what you mean be "too general". The purpose of this is to give an officially blessed way to do "DataFrame.foo = MyAccessor"; what you do past that point is up to the library. It's not like CachedAccessor ever has a chance to see the data so that it can validate things.

The only restriction on the accessor is that its init method be __init__(self, pandas_obj). I'll document that, but I'm not sure what else there is to be done.

jreback · 2018-01-05T13:53:16Z

pandas/core/categorical.py

-        self.index = index
-        self.name = name
+    def __init__(self, data):
+        self._validate(data)


see my commet above, these can be non-private I think as these are not user exposed

jreback · 2018-01-05T13:54:30Z

pandas/core/indexes/accessors.py

-    data : Series
-    copy : boolean, default False
-           copy the input data
+    def _get_values(self):


generally would like these to be non-private

I think it's best for methods and attributes on the accessor to be private. Imagine if we wanted to delegate a method like .get_values, so foo.bar.get_values. This would break the accessor, since its get_values would be broken.

jreback · 2018-01-05T13:55:38Z

pandas/core/indexes/accessors.py


-    @classmethod
-    def _make_accessor(cls, data):
+    def __new__(cls, data):


can we be consistent for validation, IOW either do it in __new__ or in __init__?

I had to put this in __new__ since we don't instantiate a CombinedDatetimelikeProperties, just one of its parents. I think this is unavoidable, but it would be nice to make it more consistent with the others. I'll see what I can do.

One option is to write a Properties._validate do most of the work here. But it feels weird to have a base class decide which child class should be instantiated. I think having it in __new__ (with a comment for why it's there) is clearest.

TomAugspurger · 2018-01-08T11:50:57Z

@jreback @jbrockmendel thoughts on the refactored version? Anything else you'd like to see changed?

jbrockmendel · 2018-01-08T18:11:34Z

thoughts on the refactored version?

Completely addresses my WYSIWYG complaint, thanks.

TomAugspurger · 2018-01-09T20:28:49Z

this is too general we need to be opionated, if there is no validation then that should be explicit.

@jreback could you clarify your thoughts here? I've documented that the only requirement on the user-defined accessor is that it's __init__ will be called with pandas_obj. I don't think there's anything the _CachedAccessor can / should do w.r.t. validation. With an accessor like .plot there's nothing to be done, so it doesn't make sense for accessors to implement a method that does nothing.

jreback

ok, looks fine. one question on the warning.

jreback · 2018-01-10T00:31:18Z

pandas/errors/__init__.py

@@ -65,3 +65,7 @@ class MergeError(ValueError):
    Error raised when problems arise during merging due to problems
    with input data. Subclass of `ValueError`.
    """
+
+
+class AccessorRegistrationWarning(Warning):


do we really need a special warning here?

Switched to a UserWarning, which matches what we do with df.x = [1, 2, 3].

ENH: Added public accessor registrar

6ae52d4

Adds new methods for registing custom accessors to pandas objects. This will be helpful for implementing pandas-dev#18767 outside of pandas. Closes pandas-dev#14781

TomAugspurger added API Design Enhancement labels Dec 18, 2017

PEP8

998bb28

gfyoung reviewed Dec 18, 2017

View reviewed changes

shoyer reviewed Dec 19, 2017

View reviewed changes

TomAugspurger added 3 commits December 19, 2017 09:57

Moved to extensions

9b20a5c

More docs

9005e1c

Fix see also

33a9f3f

jreback requested changes Dec 19, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master' into accessor-decorator

27c6af0

jreback requested changes Dec 21, 2017

View reviewed changes

TomAugspurger added 6 commits January 2, 2018 10:38

Merge remote-tracking branch 'upstream/master' into accessor-decorator

35db58d

DOC: Added whatsnew

11edc42

Move to api

682bb84

Update post review

964356f

flake8

ec505e4

Raise the underlying error instead of a RuntimeError

e76cecf

TomAugspurger added 4 commits January 4, 2018 07:10

str validate

19e9fa0

DOC: Moved to developer

c1c498c

REF: Use public registrars for accessors

ecc1cd7

Merge remote-tracking branch 'upstream/master' into accessor-decorator

663542e

TomAugspurger added 5 commits January 4, 2018 12:49

Document cache

d910a0f

Tests passing

8bcd412

Use for plot

ded3513

Fix autodoc

632f097

Fix the class instantiation

5dc4d05

TomAugspurger added 3 commits January 4, 2018 15:39

Refactor again.

28865d7

1. Removed optional caching 2. Refactored `Properties` to create the indexes it uses on demand 3. Moved accessor definitions to classes for clarity

Fix API files

3bf4889

Remove stale comment

dea5d17

TomAugspurger commented Jan 4, 2018

View reviewed changes

TomAugspurger added 3 commits January 5, 2018 06:03

Tests pass

9559f12

DOC: some cleanup

b00b0f8

No need to assign doc

f03777f

jreback requested changes Jan 5, 2018

View reviewed changes

TomAugspurger added 2 commits January 5, 2018 08:28

Rename, shared docs

1bedf9f

Doc __new__

018facd

jreback requested changes Jan 10, 2018

View reviewed changes

TomAugspurger added 4 commits January 10, 2018 07:19

Use UserWarning

a308a2e

Update test

66e2207

Merge remote-tracking branch 'upstream/master' into accessor-decorator

1c75879

Merge remote-tracking branch 'upstream/master' into accessor-decorator

fd40244

TomAugspurger merged commit eee83e2 into pandas-dev:master Jan 16, 2018

TomAugspurger deleted the accessor-decorator branch January 16, 2018 00:43

chris-b1 mentioned this pull request Jan 18, 2018

Subclass pandas DataFrame with required argument #19300

Closed



		# Ported with modifications from xarray
		# https://github.com/pydata/xarray/blob/master/xarray/core/extensions.py

		# https://github.com/pydata/xarray/blob/master/xarray/core/extensions.py


		class _CachedAccessor(object):

		@@ -0,0 +1,4 @@
		"""Public API for extending panadas objects."""

		from pandas.errors import AccessorRegistrationWarning


		@contextlib.contextmanager

		@@ -119,6 +119,56 @@ Current Behavior

		s.rank(na_option='top')


		Extending Pandas Objects with New Accessors

		# 1. We don't need to catch and re-raise AttributeErrors as RuntimeErrors


		class _CachedAccssor(object):

Uh oh!

ENH: Added public accessor registrar #18827

ENH: Added public accessor registrar #18827

Uh oh!

Conversation

TomAugspurger commented Dec 18, 2017

Uh oh!

pep8speaks commented Dec 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on January 15, 2018 at 12:34 Hours UTC

Uh oh!

gfyoung commented Dec 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Dec 18, 2017

Uh oh!

codecov bot commented Dec 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 19, 2017

Uh oh!

TomAugspurger commented Dec 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jan 2, 2018

Uh oh!

jreback commented Jan 4, 2018

Uh oh!

TomAugspurger commented Jan 4, 2018

Uh oh!

shoyer commented Jan 4, 2018

Uh oh!

jbrockmendel commented Jan 4, 2018

Uh oh!

TomAugspurger commented Jan 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jan 8, 2018

pep8speaks commented Dec 18, 2017 •

edited

Loading

codecov bot commented Dec 19, 2017 •

edited

Loading

TomAugspurger commented Dec 19, 2017 •

edited

Loading