Skip to content

bug in dataframe.join() #2189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saroele opened this issue Nov 7, 2012 · 4 comments
Closed

bug in dataframe.join() #2189

saroele opened this issue Nov 7, 2012 · 4 comments
Labels
Milestone

Comments

@saroele
Copy link

saroele commented Nov 7, 2012

Hi all,

I have a strange issue with pandas 0.9, I think it's a bug. I'm trying to use dataframe.join() and it works well on a random dataframe, but not on a dataframe created from my simulation result. The code below shows that join() on the second dataframe blows up the index and the result is completely wrong.

To run this code you need this file: https://dl.dropbox.com/u/6200325/mydf.dataframe in your work folder. The script below can also be downloaded here: https://dl.dropbox.com/u/6200325/BugJoin.py

This is the result I get:

In [17]: run -i 'C:\Workspace\Python\Tests\BugJoin.py'

Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Empty DataFrame

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)

After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)

Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Empty DataFrame

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Data columns:
SID0000 108355 non-null values
dtypes: float64(1)

After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4054807 entries, 2010-01-09 16:00:00 to 2010-05-17 15:55:42
Data columns:
SID0000 4054807 non-null values
dtypes: float64(1)

This is the code from the script:

import numpy as np
import pandas as pd
from scipy.integrate import cumtrapz

df1=pd.DataFrame(np.random.rand(1e5), 
     index=pd.date_range('2012-01-01', freq='H', periods=1e5))

df2=pd.load('mydf.dataframe')

for dataframe in [df1, df2]:

    cum = pd.DataFrame(index=dataframe.index)
    for c in dataframe.columns:
        # we need to remove the empty values for the cumtrapz function to work
        ts = dataframe[c].dropna()

        tscum = pd.DataFrame(data=cumtrapz(ts.values, ts.index.asi8/1e9, initial=0),
                         index=ts.index, 
                         columns=[c])
        print '\nBefore:'
        print cum, '\n'
        print tscum, '\n'

        cum=cum.join(tscum, how='left')

        print 'After:'
        print cum
@changhiskhan
Copy link
Contributor

I think that's caused by df2 having a non-unique index. When you join non-unique indices the result is a cartesian product (see docs).

For example:

In [9]: from pandas import DataFrame

In [10]: idx = ['a', 'a', 'b', 'b']

In [11]: df1 = DataFrame(rand(len(idx), 2), index=idx, columns=['A', 'B'])

In [12]: df2 = DataFrame(rand(len(idx), 2), index=idx, columns=['X', 'Y'])

In [13]: df1.join(df2, how='left')
Out[13]: 
          A         B         X         Y
a  0.522150  0.275144  0.625306  0.502149
a  0.522150  0.275144  0.770039  0.441860
a  0.483269  0.450600  0.625306  0.502149
a  0.483269  0.450600  0.770039  0.441860
b  0.812180  0.448433  0.758224  0.661106
b  0.812180  0.448433  0.293415  0.762360
b  0.952718  0.799325  0.758224  0.661106
b  0.952718  0.799325  0.293415  0.762360

@saroele
Copy link
Author

saroele commented Nov 7, 2012

The non-unique time-index is because the timeseries has discontinuities (continuous time, but steps in the values). This is typically for on/off systems for example. At these discontinuities, I have the same time-index, twice, and two different values.

Is there an easy way to obtain the 'expected' behaviour on join? Does combine_first accept non-unique keys?

@changhiskhan
Copy link
Contributor

combine_first will align the two first so no (can't reindex non-uq).

I'm assuming by "expected", you want the duplicated entries to be joined in the order as they appear right? You'll have to do some contortion here AFAIK. If you can assign number of observations per unique time that would work. Otherwise you can do something like join the unique elements and then group the duplicates by time then join each group ignoring the index. If most things are duplicated you might even just group the whole thing.

@wesm probably knows a better way if he wants to chime in.

@wesm
Copy link
Member

wesm commented Nov 9, 2012

It would be nice to have an "as of join" where we have the option to match duplicate observations in the order they're observed.

The bug you reported here seems to have been fixed by #2197. I ran your test script and the output looks good now:

In [4]: cum
Out[4]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 229049 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Data columns:
SID0000    229049  non-null values
dtypes: float64(1)

@wesm wesm closed this as completed Nov 9, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants