Skip to content

bug in dataframe.join() #2189

Closed
Closed
@saroele

Description

@saroele

Hi all,

I have a strange issue with pandas 0.9, I think it's a bug. I'm trying to use dataframe.join() and it works well on a random dataframe, but not on a dataframe created from my simulation result. The code below shows that join() on the second dataframe blows up the index and the result is completely wrong.

To run this code you need this file: https://dl.dropbox.com/u/6200325/mydf.dataframe in your work folder. The script below can also be downloaded here: https://dl.dropbox.com/u/6200325/BugJoin.py

This is the result I get:

In [17]: run -i 'C:\Workspace\Python\Tests\BugJoin.py'

Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Empty DataFrame

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)

After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)

Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Empty DataFrame

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Data columns:
SID0000 108355 non-null values
dtypes: float64(1)

After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4054807 entries, 2010-01-09 16:00:00 to 2010-05-17 15:55:42
Data columns:
SID0000 4054807 non-null values
dtypes: float64(1)

This is the code from the script:

import numpy as np
import pandas as pd
from scipy.integrate import cumtrapz

df1=pd.DataFrame(np.random.rand(1e5), 
     index=pd.date_range('2012-01-01', freq='H', periods=1e5))

df2=pd.load('mydf.dataframe')

for dataframe in [df1, df2]:

    cum = pd.DataFrame(index=dataframe.index)
    for c in dataframe.columns:
        # we need to remove the empty values for the cumtrapz function to work
        ts = dataframe[c].dropna()

        tscum = pd.DataFrame(data=cumtrapz(ts.values, ts.index.asi8/1e9, initial=0),
                         index=ts.index, 
                         columns=[c])
        print '\nBefore:'
        print cum, '\n'
        print tscum, '\n'

        cum=cum.join(tscum, how='left')

        print 'After:'
        print cum

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions