series finalized not correctly called in merge? #6923

wcbeard · 2014-04-22T13:39:38Z

I got some help from Jeff on stackoverflow, but either I'm misunderstanding the way __finalized__ works, or there's a bug in how it's called. My intent was to preserve series metadata after 2 dataframes being merged, and I believe __finalize__ should be able to handle this.

I define a couple dataframes, and assign metadata values to all the series:

import numpy as np
import pandas as pd
np.random.seed(10)
df1 = pd.DataFrame(np.random.randint(0, 4, (3, 2)), columns=['a', 'b'])
df2 = pd.DataFrame(np.random.randint(0, 4, (3, 2)), columns=['c', 'd'])
df1
     a  b
  0  1  1
  1  0  3
  2  0  1
df2
     c  d
  0  3  0
  1  1  1
  2  0  1

Then I assign metadata field filename to series

pd.Series._metadata = ['name', 'filename']

for c1 in df1:
    df1[c1].filename = 'fname1.csv'
for c2 in df2:
    df2[c2].filename = 'fname2.csv'

Now, I'm defining __finalize__ for series, which I understand is able to propagate metadata from one series to the other, for example when I want to merge. But when I define a __finalize__ that prints off the metadata that I've already assigned, it looks like by the time it calls __finalize__, it no longer has the metadata.

def finalize_ser(self, other, method=None, **kwargs):
  print 'Self meta: {}'.format(getattr(self, 'filename', None))
  print 'Other meta: {}'.format(getattr(other, 'filename', None))

  for name in self._metadata:
      object.__setattr__(self, name, getattr(other, name, ''))
  return self

pd.Series.__finalize__ = finalize_ser

When I call merge, I never see the correct metadata printed off

df1.merge(df2, left_on=['a'], right_on=['c'], how='inner')
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Out[5]:
     a  b  c  d
  0  1  1  1  1
  1  0  3  0  1
  2  0  1  0  1

It appears the metadata is lost before it gets to the __finalize__ call, though it's still in the original series

df1.a.filename  # => 'fname1.csv'
mgd.a.filename  # => AttributeError

Is this expected or is there a bug?

The text was updated successfully, but these errors were encountered:

jreback · 2014-04-22T14:05:32Z

I think you are misunderstanding what I wrote on SO.
impossible to keep the attached metadata on a series when its part of a frame. The metadata is simply not kept as its stored as an ndarray.

with a patch (#6924), the following will work - it will propogate metadata on the combination ONLY

In [1]: np.random.seed(10)

In [2]: df1 = DataFrame(np.random.randint(0, 4, (3, 2)), columns=['a', 'b'])

In [3]: df2 = DataFrame(np.random.randint(0, 4, (3, 2)), columns=['c', 'd'])

In [4]: DataFrame._metadata = ['filename']

In [5]: df1.filename = 'fname1.csv'

In [6]: df2.filename = 'fname2.csv'

In [7]: def finalize(self, other, method=None, **kwargs):
   ...:     for name in self._metadata:
   ...:         if method == 'merge':
   ...:             left, right = other.left, other.right
   ...:             value = getattr(left, name, '') + '|' + getattr(right, name, '')
   ...:             object.__setattr__(self, name, value)
   ...:         else:
   ...:             object.__setattr__(self, name, getattr(other, name, ''))
                return self
In [9]: DataFrame.__finalize__ = finalize

In [10]: result = df1.merge(df2, left_on=['a'], right_on=['c'], how='inner')

In [11]: result.filename
Out[11]: 'fname1.csv|fname2.csv'

What i think you may want is to use a multi-level frame where the filename is a level

wcbeard · 2014-04-22T14:15:37Z

Interesting, so when I assign metadata/attributes to each series within a dataframe there's no mechanism to keep it? Metadata is only kept on series when treated independently from a dataframe?

If so, I should still be able to keep dataframe-level metadata as a dict of {Series name: attribute value}, which should still be propagated, correct?

jreback · 2014-04-22T14:19:48Z

@d10genes yes, you could do that if you wanted to. The entire problem arises from how to combine them.

Imagine we supported this:

s1.filename='a'
s2.filename='b'

what is (s1+s2).filename?

you might look at https://github.com/kjordahl/geopandas they did a full sub-class of DataFrame/Series and DO propogate metadata.

wcbeard · 2014-04-22T14:21:46Z

That makes sense, I'll take a look. Thanks.

jreback · 2014-04-22T14:23:20Z

gr8 thanks!

jreback mentioned this issue Apr 22, 2014

BUG/INT: Internal tests for patching __finalize__ / bug in merge not finalizing (GH6923) #6924

Merged

jreback added Bug labels Apr 22, 2014

jreback added this to the 0.14.0 milestone Apr 22, 2014

wcbeard closed this as completed Apr 22, 2014

wcbeard mentioned this issue Apr 22, 2014

DataFrame.__finalize__ not called in pd.concat #6927

Closed

jreback mentioned this issue Apr 27, 2014

Allow custom metadata to be attached to panel/df/series? #2485

Closed

ljchang mentioned this issue Feb 11, 2018

FexSeries doesn't inherit metadata from Fex cosanlab/py-feat#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

series finalized not correctly called in merge? #6923

series finalized not correctly called in merge? #6923

wcbeard commented Apr 22, 2014

jreback commented Apr 22, 2014

wcbeard commented Apr 22, 2014

jreback commented Apr 22, 2014

wcbeard commented Apr 22, 2014

jreback commented Apr 22, 2014

series __finalized__ not correctly called in merge? #6923

series __finalized__ not correctly called in merge? #6923

Comments

wcbeard commented Apr 22, 2014

jreback commented Apr 22, 2014

wcbeard commented Apr 22, 2014

jreback commented Apr 22, 2014

wcbeard commented Apr 22, 2014

jreback commented Apr 22, 2014

series finalized not correctly called in merge? #6923

series finalized not correctly called in merge? #6923