Skip to content

series __finalized__ not correctly called in merge? #6923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wcbeard opened this issue Apr 22, 2014 · 5 comments · Fixed by #6924
Closed

series __finalized__ not correctly called in merge? #6923

wcbeard opened this issue Apr 22, 2014 · 5 comments · Fixed by #6924
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Testing pandas testing functions or related to the test suite
Milestone

Comments

@wcbeard
Copy link
Contributor

wcbeard commented Apr 22, 2014

I got some help from Jeff on stackoverflow, but either I'm misunderstanding the way __finalized__ works, or there's a bug in how it's called. My intent was to preserve series metadata after 2 dataframes being merged, and I believe __finalize__ should be able to handle this.

I define a couple dataframes, and assign metadata values to all the series:

import numpy as np
import pandas as pd
np.random.seed(10)
df1 = pd.DataFrame(np.random.randint(0, 4, (3, 2)), columns=['a', 'b'])
df2 = pd.DataFrame(np.random.randint(0, 4, (3, 2)), columns=['c', 'd'])
df1
     a  b
  0  1  1
  1  0  3
  2  0  1
df2
     c  d
  0  3  0
  1  1  1
  2  0  1

Then I assign metadata field filename to series

pd.Series._metadata = ['name', 'filename']

for c1 in df1:
    df1[c1].filename = 'fname1.csv'
for c2 in df2:
    df2[c2].filename = 'fname2.csv'

Now, I'm defining __finalize__ for series, which I understand is able to propagate metadata from one series to the other, for example when I want to merge. But when I define a __finalize__ that prints off the metadata that I've already assigned, it looks like by the time it calls __finalize__, it no longer has the metadata.

def finalize_ser(self, other, method=None, **kwargs):
  print 'Self meta: {}'.format(getattr(self, 'filename', None))
  print 'Other meta: {}'.format(getattr(other, 'filename', None))

  for name in self._metadata:
      object.__setattr__(self, name, getattr(other, name, ''))
  return self

pd.Series.__finalize__ = finalize_ser

When I call merge, I never see the correct metadata printed off

df1.merge(df2, left_on=['a'], right_on=['c'], how='inner')
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Out[5]:
     a  b  c  d
  0  1  1  1  1
  1  0  3  0  1
  2  0  1  0  1

It appears the metadata is lost before it gets to the __finalize__ call, though it's still in the original series

df1.a.filename  # => 'fname1.csv'
mgd.a.filename  # => AttributeError

Is this expected or is there a bug?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

I think you are misunderstanding what I wrote on SO.
impossible to keep the attached metadata on a series when its part of a frame. The metadata is simply not kept as its stored as an ndarray.

with a patch (#6924), the following will work - it will propogate metadata on the combination ONLY

In [1]: np.random.seed(10)

In [2]: df1 = DataFrame(np.random.randint(0, 4, (3, 2)), columns=['a', 'b'])

In [3]: df2 = DataFrame(np.random.randint(0, 4, (3, 2)), columns=['c', 'd'])

In [4]: DataFrame._metadata = ['filename']

In [5]: df1.filename = 'fname1.csv'

In [6]: df2.filename = 'fname2.csv'

In [7]: def finalize(self, other, method=None, **kwargs):
   ...:     for name in self._metadata:
   ...:         if method == 'merge':
   ...:             left, right = other.left, other.right
   ...:             value = getattr(left, name, '') + '|' + getattr(right, name, '')
   ...:             object.__setattr__(self, name, value)
   ...:         else:
   ...:             object.__setattr__(self, name, getattr(other, name, ''))
                return self
In [9]: DataFrame.__finalize__ = finalize

In [10]: result = df1.merge(df2, left_on=['a'], right_on=['c'], how='inner')

In [11]: result.filename
Out[11]: 'fname1.csv|fname2.csv'

What i think you may want is to use a multi-level frame where the filename is a level

@jreback jreback added this to the 0.14.0 milestone Apr 22, 2014
@wcbeard
Copy link
Contributor Author

wcbeard commented Apr 22, 2014

Interesting, so when I assign metadata/attributes to each series within a dataframe there's no mechanism to keep it? Metadata is only kept on series when treated independently from a dataframe?

If so, I should still be able to keep dataframe-level metadata as a dict of {Series name: attribute value}, which should still be propagated, correct?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@d10genes yes, you could do that if you wanted to. The entire problem arises from how to combine them.

Imagine we supported this:

s1.filename='a'
s2.filename='b'

what is (s1+s2).filename?

you might look at https://github.com/kjordahl/geopandas they did a full sub-class of DataFrame/Series and DO propogate metadata.

@wcbeard
Copy link
Contributor Author

wcbeard commented Apr 22, 2014

That makes sense, I'll take a look. Thanks.

@wcbeard wcbeard closed this as completed Apr 22, 2014
@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

gr8 thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Testing pandas testing functions or related to the test suite
Projects
None yet
2 participants