Skip to content

API: interaction between subclasses (implementing __finalize__) and attrs/flags handling in __finalize__ #37099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Oct 13, 2020 · 5 comments
Labels
API Design metadata _metadata, .attrs Subclassing Subclassing pandas objects

Comments

@jorisvandenbossche
Copy link
Member

Pandas provides the mechanism for subclasses to customize how metadata propagation is handled through _metadata and __finalize__. However, this got more complicated with the introduction of attrs and flags, which are also handled in __finalize__.

Concrete example: GeoPandas implements a custom GeoDataFrame.__finalize__ to be able to have custom logic about how metadata is propagated in methods.
See here for the current implementation: https://github.com/geopandas/geopandas/blob/7044aa479dbae72c14dfb647a5a782a01cff81d1/geopandas/geodataframe.py#L1148-L1162 (it specifically checks for method="merge"|"concat" to ensure we pass through _metadata of the first (concat) or left (merge) object).

For a long time, pandas only add a very simple __finalize__ implementation (which just propagated _metadata unconditionally):

    def __finalize__(self, other, method=None, **kwargs):
        """
        Propagate metadata from other to self.
        Parameters
        ----------
        other : the object from which to get the attributes that we are going
            to propagate
        method : optional, a passed method name ; possibly to take different
            types of propagation actions based on this
        """
        if isinstance(other, NDFrame):
            for name in self._metadata:
                object.__setattr__(self, name, getattr(other, name, None))
        return self

However, with the recent changes, the default __finalize__ now also does (code link to current finalize impl):

  • propagate attrs
  • propagate flags.allows_duplicate_labels and have custom logic for method="concat" for this

As a result of GeoPandas overriding __finalize__, the attrs and flags features currently don't work for GeoDataFrames. See geopandas/geopandas#1654 for the bug report on the GeoPandas side (cc @martinfleis)
While we certainly want to enable those features for GeoDataFrames as well.


So the main question is: given this additional logic in the default __finalize__, how can we enable that subclasses can still override a part of that logic (eg only _metadata).

GeoPandas could in principle copy the implementation of pandas (and adds its own customizations to it) and keep up to date with it. But, that doesn't feel like the ideal solution, certainly if pandas is going to add more functionality to it (eg more flags).

cc @TomAugspurger

@jorisvandenbossche jorisvandenbossche added Bug Needs Triage Issue that has not been reviewed by a pandas team member API Design and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2020
@jorisvandenbossche
Copy link
Member Author

Hmm, for some reason I thought we couldn't do a "simple" super().__finalize__(other, method=method, **kwargs) in geopandas to combine pandas' logic with our own (I was thinking that both parts of the logic could not be done after each other, but would really need to be combined / interleaved). But in hindsight, calling super() first and then adding our custom logic for merge/concat migth actually work.

A patch to GeoPandas like

--- a/geopandas/geodataframe.py
+++ b/geopandas/geodataframe.py
@@ -894,6 +894,8 @@ class GeoDataFrame(GeoPandasBase, DataFrame):
 
     def __finalize__(self, other, method=None, **kwargs):
         """propagate metadata from other to self """
+        self = super().__finalize__(other, method=method, **kwargs)
+
         # merge operation: using metadata of the left object
         if method == "merge":
             for name in self._metadata:
@@ -902,9 +904,6 @@ class GeoDataFrame(GeoPandasBase, DataFrame):
         elif method == "concat":
             for name in self._metadata:
                 object.__setattr__(self, name, getattr(other.objs[0], name, None))
-        else:
-            for name in self._metadata:
-                object.__setattr__(self, name, getattr(other, name, None))
 
         return self

seems to still pass all our tests, and enables preserving attrs:

In [1]: df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_cities"))

In [2]: df.attrs['name']= 'my_name'

In [3]: df[:5].attrs
Out[3]: {'name': 'my_name'}

(so until there all my effort to clearly describe the issue .. ;-))

@TomAugspurger
Copy link
Contributor

Hehe, yeah that was my initial reaction. Sometimes it takes writing up an issue to clarify the solution though.

As long as the super() doesn't do anything you don't want done (and can't easily undo), then things should be OK.

@jorisvandenbossche
Copy link
Member Author

As long as the super() doesn't do anything you don't want done (and can't easily undo), then things should be OK.

Yes, that's probably the tricky part long term.
But currently, this seems OK. Pandas is passing through attrs when other is an NDFrame (not something geopandas wants to change), setting the duplicate flag (again not something geopandas wants to change), and propagating _metadata (something geopandas also did, but in the case of other being an NDFrame, we didn't have any custom logic for it, so we can defer this to pandas).
At the moment we only were adding logic (for merge and concat), but this can currently be cleanly done after pandas' logic.

@mroeschke mroeschke added metadata _metadata, .attrs Subclassing Subclassing pandas objects labels Aug 13, 2021
@rhshadrach
Copy link
Member

@jorisvandenbossche @TomAugspurger - can this be closed?

@TomAugspurger
Copy link
Contributor

I'm OK with closing for now. We could potentially split __finalize__ into a few different methods, but I'm concerned about the performance of calling too many functions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design metadata _metadata, .attrs Subclassing Subclassing pandas objects
Projects
None yet
Development

No branches or pull requests

4 participants