PERF: groupby-fillna perf, implement in cython

Hello,

I work on a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NaN values in the columns (Value, Id). I want to fill forward value but by Date only and I don't find anyway to do this a in a very efficient way. I'm using Pandas 0.16.2 and numpy 1.9.2.

Here is the type of dataframe I have :

Dataframe example
![image](https://cloud.githubusercontent.com/assets/15088142/10427077/b33c3e20-70e6-11e5-9c28-1644791ef46d.png)

And here is the result I want :

![image](https://cloud.githubusercontent.com/assets/15088142/10427084/c331ff54-70e6-11e5-8ee8-bb499c6cc6e5.png)

So to properly fillback by date I can use groupby(level=0) function. The groupby call is fast but the fill  forward function applied on the "group by dataframe" is really too slow.

Here is the code I use to compare simple fill forward (which doesn't give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).

```
import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"
```

Here are the time results I have:

![image](https://cloud.githubusercontent.com/assets/15088142/10427113/fd415ec4-70e6-11e5-96fd-5b669e738843.png)

So, the fill foward on the group by dataframe is 10000 times slower than the simple fill forward. I cannot understand why pandas is running so slowly. I need to have comparable perfs with the simple fill forward so just a couple of milliseconds.

Could somebody address the performance issue? Or give me a solution to do this kind of action in a very efficient way?

Thanks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: groupby-fillna perf, implement in cython #11296

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: groupby-fillna perf, implement in cython #11296

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions