Skip to content

ENH: Added public accessor registrar #18827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jan 16, 2018

Conversation

TomAugspurger
Copy link
Contributor

Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing #18767
outside of pandas.

If we accept this I'm not sure it belongs in the top-level namespace where I have it currently. What would a good home be for this be? pd.api.extensions?

Closes #14781

Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
@pep8speaks
Copy link

pep8speaks commented Dec 18, 2017

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 15, 2018 at 12:34 Hours UTC

@gfyoung
Copy link
Member

gfyoung commented Dec 18, 2017

@TomAugspurger : A couple of points:

  1. Re: your question, it depends on how much importance you put on the resource. Personally, I'm fine with exposing it in the top-level namespace. xarray does this, so one could argue that we should do the same for consistency. Unless you're using this solely for internal stuff, then I don't see why not.

  2. Could you explain the differences / changes you needed to make for this to port over to pandas ? That would be very useful for understanding why you needed to add this to pandas.

>>> ds.geo.center
(5.0, 10.0)
>>> ds.geo.plot()
# plots data on a map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this less abstract, you probably will need plots and some more concrete code to actually plot these points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this being abstract. The point is the accessor, not the functionality provided by it, if that makes sense.

@jbrockmendel
Copy link
Member

There might be some useful things to pull out of the tests for the now-abandoned #17042. I wrote a moderately fleshed out example for that.

@codecov
Copy link

codecov bot commented Dec 19, 2017

Codecov Report

Merging #18827 into master will decrease coverage by 0.01%.
The diff coverage is 93.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18827      +/-   ##
==========================================
- Coverage   91.55%   91.53%   -0.02%     
==========================================
  Files         147      148       +1     
  Lines       48812    48837      +25     
==========================================
+ Hits        44690    44704      +14     
- Misses       4122     4133      +11
Flag Coverage Δ
#multiple 89.91% <93.91%> (-0.02%) ⬇️
#single 41.7% <57.39%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.62% <100%> (ø) ⬆️
pandas/api/extensions/__init__.py 100% <100%> (ø)
pandas/core/accessor.py 98.7% <100%> (+4.95%) ⬆️
pandas/core/series.py 94.61% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.46% <100%> (ø) ⬆️
pandas/errors/__init__.py 100% <100%> (ø) ⬆️
pandas/core/categorical.py 95.78% <100%> (ø) ⬆️
pandas/core/base.py 96.77% <100%> (ø) ⬆️
pandas/core/indexes/accessors.py 90% <88%> (+0.63%) ⬆️
pandas/core/strings.py 98.17% <92.85%> (-0.3%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 787ab55...fd40244. Read the comment docs.



# Ported with modifications from xarray
# https://github.com/pydata/xarray/blob/master/xarray/core/extensions.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, please do let me know exactly what you needed to change :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the relevant diff between xarray and pandas:

@TomAugspurger : For better or worse, that's a much less significant change than I expected. Maybe should add a comment about that.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2017

this should not be in the main public namespace.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 19, 2017

Here's the relevant diff between xarray and pandas:

diff --git a/xarray.py b/pandas.py
index 48b2cba..fc3dd6b 100644
--- a/xarray.py
+++ b/pandas.py
@@ -11,9 +11,10 @@ class _CachedAccessor(object):
         try:
             accessor_obj = self._accessor(obj)
         except AttributeError:
-            # __getattr__ on data object will swallow any AttributeErrors raised
-            # when initializing the accessor, so we need to raise as something
-            # else (GH933):
+            # TODO
+            # __getattr__ on data object will swallow any AttributeErrors
+            # raised when initializing the accessor, so we need to raise
+            # as something else (GH933):
             msg = 'error initializing %r accessor.' % self._name
             if PY2:
                 msg += ' Full traceback:\n' + traceback.format_exc()
@@ -30,9 +31,9 @@ def _register_accessor(name, cls):
     def decorator(accessor):
         if hasattr(cls, name):
             warnings.warn(
-                'registration of accessor %r under name %r for type %r is '
-                'overriding a preexisting attribute with the same name.'
-                % (accessor, name, cls),
+                'registration of accessor {!r} under name {!r} for type '
+                '{!r} is overriding a preexisting attribute with the same '
+                'name.'.format(accessor, name, cls),
                 AccessorRegistrationWarning,
                 stacklevel=2)
         setattr(cls, name, _CachedAccessor(name, accessor))

(my TODO is for whether or not that matters for pandas). Otherwise, we're trying to use new-style string formatting.

# https://github.com/pydata/xarray/blob/master/xarray/core/extensions.py


class _CachedAccessor(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are not using this for the internal accessors?

@@ -0,0 +1,4 @@
"""Public API for extending panadas objects."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is better, but still not a fan of extending the top-level namespaces (e.g. with extensions). api.extensions would be better

@@ -41,6 +41,13 @@ class TestPDApi(Base):
# misc
misc = ['IndexSlice', 'NaT']

# extension points
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

from pandas.errors import AccessorRegistrationWarning


@contextlib.contextmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use the pytest monkeypatch fixture for this

@TomAugspurger
Copy link
Contributor Author

Updated.

  1. xarray had to catch AttributeError raised when creating the accessor due to how their __getitem__ works. I don't think pandas needs to do this (rc 0.8: register_dataset_accessor silently ignores AttributeError pydata/xarray#933)
  2. The extra level for pd.api.extensions rather than pd.extensions seems unnecessary (flat is better than nested and all that), but OK either way
  3. pytest.monkeypatch seems to be for functions, not instances of objects, unless I'm mistaken
  4. I have a slight preference to focusing this PR on just adding the register_*_accessor, but I can also refactor the internals to use it. Really though, these are independent. The only real change would be to pd.api.extensions.resgister_series_accessor("str")(StringMethods) instead of str = accessor.AccessorProperty(strings.StringMethods). The register* decorators are just for registering, not for implementing accessors themselves.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2018

I have a slight preference to focusing this PR on just adding the register__accessor, but I can also refactor the internals to use it. Really though, these are independent. The only real change would be to pd.api.extensions.resgister_series_accessor("str")(StringMethods) instead of str = accessor.AccessorProperty(strings.StringMethods). The register decorators are just for registering, not for implementing accessors themselves.

I disagree. The best use of this is to actually implement the internals. If it works, great, if not then you have found a bug / feature lacking.

@TomAugspurger
Copy link
Contributor Author

Refactored to have our accessors use it.

  1. All the accessors are registered in pandas.core.register_accessors after the classes have been initialized
  2. I ran into a subtle issue around caching. The accessor from xarray is cached such that x.foo is the same object everytime. With our old AccessorProperty, we recreated the accessor every time. This caused some issues when the object itself was mutable, but the accessor cached some state. Notably, this would occur:
In [9]: s = pd.Series(pd.TimedeltaIndex(['5S', '10S']))

In [10]: s.dt.values.isna()
Out[10]: array([False, False], dtype=bool)

In [11]: s[0] = float('nan')

# isna has been cached on the accessor's .values, but not invalidated
# when s was mutated
In [12]: s.dt.values.isna()
Out[12]: array([ False, False], dtype=bool)

To work around this, I added a cache keyword argument (@shoyer would xarray be interested in this keyword?)

  1. Moved the new docs to "developer.rst" instead of internals, since it's for "downstream applications" (we should probably move the subclassing docs there)
  2. Removed the now unused AccessorProperty class

@shoyer
Copy link
Member

shoyer commented Jan 4, 2018

I ran into a subtle issue around caching. The accessor from xarray is cached such that x.foo is the same object everytime. With our old AccessorProperty, we recreated the accessor every time. This caused some issues when the object itself was mutable, but the accessor cached some state.

To work around this, I added a cache keyword argument (@shoyer would xarray be interested in this keyword?)

This seems like an inadvertent detail of the implementation of Series.dt.values (why do we even have that property?). The datetime accessor creates a DatetimeIndex internally for method/properties. DatetimeIndex.values is fixed, because pandas.Index objects are immutable.

So the other fix would be to change how the .dt accessor implemented so that DatetimeIndex objects are created on demand when a method/property is called, instead of being created once and saved on the acccessor.

I don't think we need a cache=False option for xarray since nobody has written any accessors relying on this behavior.

@jbrockmendel
Copy link
Member

Making it easy to extend functionality is great, but the new commit that changes where the current str/cat/dt attributes are defined I'm very much -1 on. Inheritance notwithstanding, the class definition should be WYSIWYG. Newbies shouldn't have to track down where e.g. DataFrame.pivot_table is defined.

@TomAugspurger
Copy link
Contributor Author

This seems like an inadvertent detail of the implementation of Series.dt.values (why do we even have that property?).

.values isn't actually in dir(s.dt). Should probably be ._values but that's how it was before. Regardless, I'll see how difficult it is to create them on demand.

the class definition should be WYSIWYG

That's my preference as well, we just won't be able to / have to use the register_*_accessor functions, since those rely on the classes being defined.

1. Removed optional caching
2. Refactored `Properties` to create the indexes it uses on demand
3. Moved accessor definitions to classes for clarity
@@ -119,6 +119,56 @@ Current Behavior

s.rank(na_option='top')


Extending Pandas Objects with New Accessors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on removing this section in the whatsnew? This may be too niche for most users to care about.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. some api consistency questions.

# 1. We don't need to catch and re-raise AttributeErrors as RuntimeErrors


class _CachedAccssor(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this: CachedAccessor we use this internally so private is not appropriate

# NDFrame
object.__setattr__(obj, self._name, accessor_obj)
return accessor_obj

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make sense to have a few methods here which are AbstractMethodError to guide the api, eg. validate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave that open to the person implementing the accessor. Perhaps for some accessors, there's no need for validation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too general we need to be opionated, if there is no validation then that should be explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean be "too general". The purpose of this is to give an officially blessed way to do "DataFrame.foo = MyAccessor"; what you do past that point is up to the library. It's not like CachedAccessor ever has a chance to see the data so that it can validate things.

The only restriction on the accessor is that its init method be __init__(self, pandas_obj). I'll document that, but I'm not sure what else there is to be done.

self.index = index
self.name = name
def __init__(self, data):
self._validate(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my commet above, these can be non-private I think as these are not user exposed

data : Series
copy : boolean, default False
copy the input data
def _get_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally would like these to be non-private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best for methods and attributes on the accessor to be private. Imagine if we wanted to delegate a method like .get_values, so foo.bar.get_values. This would break the accessor, since its get_values would be broken.


@classmethod
def _make_accessor(cls, data):
def __new__(cls, data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be consistent for validation, IOW either do it in __new__ or in __init__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to put this in __new__ since we don't instantiate a CombinedDatetimelikeProperties, just one of its parents. I think this is unavoidable, but it would be nice to make it more consistent with the others. I'll see what I can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to write a Properties._validate do most of the work here. But it feels weird to have a base class decide which child class should be instantiated. I think having it in __new__ (with a comment for why it's there) is clearest.

@TomAugspurger
Copy link
Contributor Author

@jreback @jbrockmendel thoughts on the refactored version? Anything else you'd like to see changed?

@jbrockmendel
Copy link
Member

thoughts on the refactored version?

Completely addresses my WYSIWYG complaint, thanks.

@TomAugspurger
Copy link
Contributor Author

this is too general we need to be opionated, if there is no validation then that should be explicit.

@jreback could you clarify your thoughts here? I've documented that the only requirement on the user-defined accessor is that it's __init__ will be called with pandas_obj. I don't think there's anything the _CachedAccessor can / should do w.r.t. validation. With an accessor like .plot there's nothing to be done, so it doesn't make sense for accessors to implement a method that does nothing.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks fine. one question on the warning.

@@ -65,3 +65,7 @@ class MergeError(ValueError):
Error raised when problems arise during merging due to problems
with input data. Subclass of `ValueError`.
"""


class AccessorRegistrationWarning(Warning):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need a special warning here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to a UserWarning, which matches what we do with df.x = [1, 2, 3].

@TomAugspurger TomAugspurger merged commit eee83e2 into pandas-dev:master Jan 16, 2018
@TomAugspurger TomAugspurger deleted the accessor-decorator branch January 16, 2018 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants