-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
bug in dataframe.join() #2189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think that's caused by df2 having a non-unique index. When you join non-unique indices the result is a cartesian product (see docs). For example:
|
The non-unique time-index is because the timeseries has discontinuities (continuous time, but steps in the values). This is typically for on/off systems for example. At these discontinuities, I have the same time-index, twice, and two different values. Is there an easy way to obtain the 'expected' behaviour on join? Does combine_first accept non-unique keys? |
combine_first will align the two first so no (can't reindex non-uq). I'm assuming by "expected", you want the duplicated entries to be joined in the order as they appear right? You'll have to do some contortion here AFAIK. If you can assign number of observations per unique time that would work. Otherwise you can do something like join the unique elements and then group the duplicates by time then join each group ignoring the index. If most things are duplicated you might even just group the whole thing. @wesm probably knows a better way if he wants to chime in. |
It would be nice to have an "as of join" where we have the option to match duplicate observations in the order they're observed. The bug you reported here seems to have been fixed by #2197. I ran your test script and the output looks good now:
|
Hi all,
I have a strange issue with pandas 0.9, I think it's a bug. I'm trying to use dataframe.join() and it works well on a random dataframe, but not on a dataframe created from my simulation result. The code below shows that join() on the second dataframe blows up the index and the result is completely wrong.
To run this code you need this file: https://dl.dropbox.com/u/6200325/mydf.dataframe in your work folder. The script below can also be downloaded here: https://dl.dropbox.com/u/6200325/BugJoin.py
This is the result I get:
In [17]: run -i 'C:\Workspace\Python\Tests\BugJoin.py'
Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Empty DataFrame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)
After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2012-01-01 00:00:00 to 2023-05-29 15:00:00
Freq: H
Data columns:
0 100000 non-null values
dtypes: float64(1)
Before:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Empty DataFrame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 108355 entries, 2010-01-01 00:00:00 to 2011-01-01 00:00:00
Data columns:
SID0000 108355 non-null values
dtypes: float64(1)
After:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4054807 entries, 2010-01-09 16:00:00 to 2010-05-17 15:55:42
Data columns:
SID0000 4054807 non-null values
dtypes: float64(1)
This is the code from the script:
The text was updated successfully, but these errors were encountered: