Skip to content

PERF: groupby-fillna perf, implement in cython #11296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
squeniart opened this issue Oct 12, 2015 · 10 comments · Fixed by #19673
Closed

PERF: groupby-fillna perf, implement in cython #11296

squeniart opened this issue Oct 12, 2015 · 10 comments · Fixed by #19673
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Milestone

Comments

@squeniart
Copy link

Hello,

I work on a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NaN values in the columns (Value, Id). I want to fill forward value but by Date only and I don't find anyway to do this a in a very efficient way. I'm using Pandas 0.16.2 and numpy 1.9.2.

Here is the type of dataframe I have :

Dataframe example
image

And here is the result I want :

image

So to properly fillback by date I can use groupby(level=0) function. The groupby call is fast but the fill forward function applied on the "group by dataframe" is really too slow.

Here is the code I use to compare simple fill forward (which doesn't give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).

import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"

Here are the time results I have:

image

So, the fill foward on the group by dataframe is 10000 times slower than the simple fill forward. I cannot understand why pandas is running so slowly. I need to have comparable perfs with the simple fill forward so just a couple of milliseconds.

Could somebody address the performance issue? Or give me a solution to do this kind of action in a very efficient way?

Thanks

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.

here's an easy way to do this.

In [45]: df
Out[45]: 
                                 Value     Id
Date       InputTime                         
2001-10-06 2001-10-06 05:00:00     NaN    NaN
           2001-10-06 13:00:00       2      1
           2001-10-06 21:00:00     NaN    NaN
2001-10-07 2001-10-07 05:00:00     NaN    NaN
           2001-10-07 13:00:00       8      4
           2001-10-07 21:00:00     NaN    NaN
...                                ...    ...
2093-01-07 2093-01-07 13:00:00  199988  99994
           2093-01-07 21:00:00     NaN    NaN
2093-01-08 2093-01-08 05:00:00     NaN    NaN
           2093-01-08 13:00:00  199994  99997
           2093-01-08 21:00:00     NaN    NaN
2093-01-09 2093-01-09 05:00:00     NaN    NaN

[100000 rows x 2 columns]

# you MUST be sorted
In [46]: df.index.is_lexsorted()
Out[46]: True

In [47]: df3 = df.reset_index()

we take the indexer of the first element of each group. note you cannot use .first here as that would skip the nans.

In [48]: indexer = df3.groupby('Date',as_index=False).nth(0).index

In [49]: indexer
Out[49]: 
Int64Index([    0,     3,     6,     9,    12,    15,    18,    21,    24,    27,
            ...
            99972, 99975, 99978, 99981, 99984, 99987, 99990, 99993, 99996, 99999], dtype='int64', length=33334)

In [50]: df3.loc[indexer].head()
Out[50]: 
         Date           InputTime  Value  Id
0  2001-10-06 2001-10-06 05:00:00    NaN NaN
3  2001-10-07 2001-10-07 05:00:00    NaN NaN
6  2001-10-08 2001-10-08 05:00:00    NaN NaN
9  2001-10-09 2001-10-09 05:00:00    NaN NaN
12 2001-10-10 2001-10-10 05:00:00    NaN NaN

Make these a non-NaN value that you don't have in your frame.

In [51]: df3.loc[indexer,'Value'] = -1

pad the NaNs, which by definition since sorted and -1 are in the first row will not propogate beyond groups. Then replace the -1 with nan again.

In [52]: df3.ffill().replace(-1,np.nan)
Out[52]: 
            Date           InputTime   Value     Id
0     2001-10-06 2001-10-06 05:00:00     NaN    NaN
1     2001-10-06 2001-10-06 13:00:00       2      1
2     2001-10-06 2001-10-06 21:00:00       2      1
3     2001-10-07 2001-10-07 05:00:00     NaN      1
4     2001-10-07 2001-10-07 13:00:00       8      4
5     2001-10-07 2001-10-07 21:00:00       8      4
...          ...                 ...     ...    ...
99994 2093-01-07 2093-01-07 13:00:00  199988  99994
99995 2093-01-07 2093-01-07 21:00:00  199988  99994
99996 2093-01-08 2093-01-08 05:00:00     NaN  99994
99997 2093-01-08 2093-01-08 13:00:00  199994  99997
99998 2093-01-08 2093-01-08 21:00:00  199994  99997
99999 2093-01-09 2093-01-09 05:00:00     NaN  99997

[100000 rows x 4 columns]

I did this on both columns to show the difference.

This is completely vectorized and will be quite efficient.

@jreback jreback closed this as completed Oct 12, 2015
@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Oct 12, 2015
@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

actually we don't have a fillna-groupby perf issue...so will leave this one open

@jreback jreback reopened this Oct 12, 2015
@jreback jreback added this to the Next Major Release milestone Oct 12, 2015
@jreback jreback changed the title Pandas Fill forward Performance Issue PERF: groupby-fillna perf Oct 12, 2015
@jreback jreback changed the title PERF: groupby-fillna perf PERF: groupby-fillna perf, implement in cython Oct 12, 2015
@squeniart
Copy link
Author

Hello,

I've tested your workaround but it can only work on the specific example I've given because it makes some assumptions on where are the NaN values. My example was built to get a big dataframe for the issue but this is not the exact reality of what are my data.

Suppose the very simple dataframe:

image

If I apply our code on this example, below is what I get:

image

As you can see, some non NaN values have been replaced by NaN values which is not expected. And the first value for a given date is not always NaN so the code doesn't behave as I would like.

Here is what I want to get after fillforward:

image

Furthermore, when I tested your code on the example, I have measured times over the second which is really too slow for my use case :-( Indeed, I need to make this operation a lot in my code.

Thanks for your help

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

pls don't post pictures of frames they are simply not helpful
instead show code to repro

your frame is likely not sorted

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

Error in the work-around

In [60]: df
Out[60]: 
                                Value
Date       Input Date                
2015-01-31 2015-02-01 09:00:00    NaN
           2015-02-01 10:00:00   5.00
           2015-03-01 09:00:00   5.25
           2015-03-01 10:00:00    NaN
           2015-04-01 08:00:00   5.50
2015-02-28 2015-03-01 09:00:00   6.00
           2015-03-01 10:00:00    NaN
           2015-04-01 07:00:00    NaN
           2015-04-01 08:00:00   6.30
2015-03-01 2015-04-01 07:00:00    NaN
           2015-04-01 08:00:00   7.00

In [61]: filler(df)
Out[61]: 
                                Value
Date       Input Date                
2015-01-31 2015-02-01 09:00:00    NaN
           2015-02-01 10:00:00   5.00
           2015-03-01 09:00:00   5.25
           2015-03-01 10:00:00   5.25
           2015-04-01 08:00:00   5.50
2015-02-28 2015-03-01 09:00:00   6.00
           2015-03-01 10:00:00   6.00
           2015-04-01 07:00:00   6.00
           2015-04-01 08:00:00   6.30
2015-03-01 2015-04-01 07:00:00    NaN
           2015-04-01 08:00:00   7.00

This is what I mean by copy-pastable

import pandas as pd
import numpy as np

Timestamp = pd.Timestamp
df = pd.DataFrame({'Value' : [np.nan,5,5.25,np.nan,5.5,6,np.nan,np.nan,6.3,np.nan,7]})
df.index = pd.MultiIndex.from_tuples([ (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-02-01 09:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-02-01 10:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-03-01 09:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-03-01 10:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-04-01 08:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-03-01 09:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-03-01 10:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-04-01 07:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-04-01 08:00:00')),
                                       (Timestamp('2015-03-01 00:00:00', offset='M'), Timestamp('2015-04-01 07:00:00')),
                                       (Timestamp('2015-03-01 00:00:00', offset='M'), Timestamp('2015-04-01 08:00:00'))],
                                     names=['Date','Input Date'])


def filler(x):
   names = x.index.names
   x = x.reset_index()
   indexer = x.groupby('Date',as_index=False).nth(0).index
   indexer = indexer[x.loc[indexer,'Value'].isnull()]
   x.loc[indexer,'Value'] = -1
   return x.ffill().replace(-1,np.nan).set_index(names)

And on the original frame

In [75]: %timeit filler(df)
1 loops, best of 3: 634 ms per loop

In [77]: df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2001-10-06 00:00:00, 2001-10-06 05:00:00) to (2093-01-09 00:00:00, 2093-01-09 05:00:00)
Data columns (total 2 columns):
Value    33333 non-null float64
Id       33333 non-null float64
dtypes: float64(2)
memory usage: 3.3+ MB

@squeniart
Copy link
Author

Sorry for the pics. I understand your point and I will provide the code to let you easily create the dataframes next time.

I confirm your last workaround is working well since I get the expected results. And the performances are better compared to:

  • a simple rows iteration algorithm to fill NA looking except when the date is changing
  • the ffill apply on the groupby result

But the performances are still 100 times slower compared to a simple fill forward and unfortunately, this is preventing me from using pandas in my project (~20/30 ms could be acceptable time). Do you think this performance issue could be addressed in a near future?

Thank you a lot for your help!

@jreback
Copy link
Contributor

jreback commented Oct 13, 2015

@squeniart well, pull-requests are accepted.

If you are constantly calling this in a time sensistive way then you are simply doing it wrong. Use caching or other techniques or roll-your own.

This is very easy to do in numba/cython.

@vitteloil
Copy link

Hi, I'm from the same team as @squeniart .

I tried to see where the culprit is, in pandas. It appears, when using groupby().fillna(method='ffill')), the cythonized code pad_2d (generated by generate_code.py) IS applied.

Problem is, in this particular example in the first message of this issue, the pad_2d_XXX code is called 66668 times, on series with only 3 elements in it. All the underlying code is in python, and is also very slow.

  1. grouping operations = 30% of the total time
  2. fillna operations = 30% of the total time
  3. merge&concat operations = 40% of the total time

figures taken from cProfiling dataframe.ffill(), profile output available here https://drive.google.com/file/d/0B3pyL0DQV74ic1p4dW5FaS12QVE/view?usp=sharing

I wonder if that would be acceptable, in a pull request, to add a method which does what we want to do. This method, for example, dataframe.ffill_reset(Column) would take a column name as a parameter, and would fill forward all the other columns, according to this Column argument. Every time the Column value changes, the fill forward stops and resets.

For example this cython function which would be the core of this new ffill_reset() function, and would be broke down for the different data types (from bool to floats, etc..):

def xpropagate_int64(ndarray[uint64_t, ndim=1] vdate,
                            ndarray[int64_t, ndim=1] vdata):

    # Set prev value to NA
    cdef uint64_t dateprev = 0
    cdef int64_t valprev
    cdef Py_ssize_t vsize = (<object> vdata).size

    # Go through date axis and fill forward NA values
    for i in range(vsize):
        # Is it a new date?
        date = vdate[i]
        val = vdata[i]

        if date != dateprev:
            valprev = val
            dateprev = date
            continue

        # this is the same date than the one before
        # and the value is NA => fill with previous value
        # val != val is how we test for NaN
        if val != val:
            vdata[i] = valprev
            continue

        # this is the same date than the one before
        # and the value is not NA => just keep in mind this value
        valprev = val

Thanks for your suggestions and help on the matter.
Henri

@jreback jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@WillAyd
Copy link
Member

WillAyd commented Feb 10, 2018

@jreback while I'm working on Cython optimizations I can take a look at this one. Just curious if we view the fact that ffill and bfillretain the grouping in it's own column as a feature or something up for discussion. To illustrate:

In []: df = pd.DataFrame({'key': ['a']*5, 'val': range(5)})
In []: df.groupby('key').rank()
Out []:
   val
0  1.0
1  2.0
2  3.0
3  4.0
4  5.0

In []: df.groupby('key').ffill()  # retains key in output; same for bfill
Out []:
  key  val
0   a    0
1   a    1
2   a    2
3   a    3
4   a    4

If bfill and ffill didn't return the grouping in a fashion similar to rank then we could leverage the same call signatures as the rest of the transformations

@WillAyd WillAyd mentioned this issue Feb 13, 2018
4 tasks
@alonme
Copy link
Contributor

alonme commented Mar 21, 2022

Hey @jreback - i found this issue through this SO comment https://stackoverflow.com/a/43251227/7581507.

I suspected that the proposed optimization (Based on your code AFAIU) wouldn't make a big difference as i see here that related cythonization has been added to the code.

However - using the discussed method, i got a speedup from 8 minutes to ~2 seconds.
is this still expected to make such a big change? can we do anything to speedup the groupby.ffill?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants