Description
Hi all,
I have a strange issue with pandas 0.9, I think it's a bug. I'm trying to use dataframe.join() and it works well on a random dataframe, but not on a dataframe created from my simulation result. The code below shows that join() on the second dataframe blows up the index and the result is completely wrong.
To run this code you need this file: https://dl.dropbox.com/u/6200325/mydf.dataframe in your work folder. The script below can also be downloaded here: https://dl.dropbox.com/u/6200325/BugJoin.py
This is the result I get:
In [17]: run -i 'C:\Workspace\Python\Tests\BugJoin.py'
Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Empty DataFrame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)
After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)
Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Empty DataFrame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Data columns:
SID0000 108355 non-null values
dtypes: float64(1)
After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4054807 entries, 2010-01-09 16:00:00 to 2010-05-17 15:55:42
Data columns:
SID0000 4054807 non-null values
dtypes: float64(1)
This is the code from the script:
import numpy as np
import pandas as pd
from scipy.integrate import cumtrapz
df1=pd.DataFrame(np.random.rand(1e5),
index=pd.date_range('2012-01-01', freq='H', periods=1e5))
df2=pd.load('mydf.dataframe')
for dataframe in [df1, df2]:
cum = pd.DataFrame(index=dataframe.index)
for c in dataframe.columns:
# we need to remove the empty values for the cumtrapz function to work
ts = dataframe[c].dropna()
tscum = pd.DataFrame(data=cumtrapz(ts.values, ts.index.asi8/1e9, initial=0),
index=ts.index,
columns=[c])
print '\nBefore:'
print cum, '\n'
print tscum, '\n'
cum=cum.join(tscum, how='left')
print 'After:'
print cum