-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: groupby-fillna perf, implement in cython #11296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this is a dupe of #7895. here's an easy way to do this.
we take the indexer of the first element of each group. note you cannot use
Make these a non-NaN value that you don't have in your frame.
pad the
I did this on both columns to show the difference. This is completely vectorized and will be quite efficient. |
actually we don't have a fillna-groupby perf issue...so will leave this one open |
Hello, I've tested your workaround but it can only work on the specific example I've given because it makes some assumptions on where are the NaN values. My example was built to get a big dataframe for the issue but this is not the exact reality of what are my data. Suppose the very simple dataframe: If I apply our code on this example, below is what I get: As you can see, some non NaN values have been replaced by NaN values which is not expected. And the first value for a given date is not always NaN so the code doesn't behave as I would like. Here is what I want to get after fillforward: Furthermore, when I tested your code on the example, I have measured times over the second which is really too slow for my use case :-( Indeed, I need to make this operation a lot in my code. Thanks for your help |
pls don't post pictures of frames they are simply not helpful your frame is likely not sorted |
Error in the work-around
This is what I mean by copy-pastable
And on the original frame
|
Sorry for the pics. I understand your point and I will provide the code to let you easily create the dataframes next time. I confirm your last workaround is working well since I get the expected results. And the performances are better compared to:
But the performances are still 100 times slower compared to a simple fill forward and unfortunately, this is preventing me from using pandas in my project (~20/30 ms could be acceptable time). Do you think this performance issue could be addressed in a near future? Thank you a lot for your help! |
@squeniart well, pull-requests are accepted. If you are constantly calling this in a time sensistive way then you are simply doing it wrong. Use caching or other techniques or roll-your own. This is very easy to do in numba/cython. |
Hi, I'm from the same team as @squeniart . I tried to see where the culprit is, in pandas. It appears, when using groupby().fillna(method='ffill')), the cythonized code pad_2d (generated by generate_code.py) IS applied. Problem is, in this particular example in the first message of this issue, the pad_2d_XXX code is called 66668 times, on series with only 3 elements in it. All the underlying code is in python, and is also very slow.
figures taken from cProfiling dataframe.ffill(), profile output available here https://drive.google.com/file/d/0B3pyL0DQV74ic1p4dW5FaS12QVE/view?usp=sharing I wonder if that would be acceptable, in a pull request, to add a method which does what we want to do. This method, for example, dataframe.ffill_reset(Column) would take a column name as a parameter, and would fill forward all the other columns, according to this Column argument. Every time the Column value changes, the fill forward stops and resets. For example this cython function which would be the core of this new ffill_reset() function, and would be broke down for the different data types (from bool to floats, etc..): def xpropagate_int64(ndarray[uint64_t, ndim=1] vdate,
ndarray[int64_t, ndim=1] vdata):
# Set prev value to NA
cdef uint64_t dateprev = 0
cdef int64_t valprev
cdef Py_ssize_t vsize = (<object> vdata).size
# Go through date axis and fill forward NA values
for i in range(vsize):
# Is it a new date?
date = vdate[i]
val = vdata[i]
if date != dateprev:
valprev = val
dateprev = date
continue
# this is the same date than the one before
# and the value is NA => fill with previous value
# val != val is how we test for NaN
if val != val:
vdata[i] = valprev
continue
# this is the same date than the one before
# and the value is not NA => just keep in mind this value
valprev = val Thanks for your suggestions and help on the matter. |
@jreback while I'm working on Cython optimizations I can take a look at this one. Just curious if we view the fact that In []: df = pd.DataFrame({'key': ['a']*5, 'val': range(5)})
In []: df.groupby('key').rank()
Out []:
val
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
In []: df.groupby('key').ffill() # retains key in output; same for bfill
Out []:
key val
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4 If |
Hey @jreback - i found this issue through this SO comment https://stackoverflow.com/a/43251227/7581507. I suspected that the proposed optimization (Based on your code AFAIU) wouldn't make a big difference as i see here that related cythonization has been added to the code. However - using the discussed method, i got a speedup from 8 minutes to ~2 seconds. |
Hello,
I work on a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NaN values in the columns (Value, Id). I want to fill forward value but by Date only and I don't find anyway to do this a in a very efficient way. I'm using Pandas 0.16.2 and numpy 1.9.2.
Here is the type of dataframe I have :
Dataframe example

And here is the result I want :
So to properly fillback by date I can use groupby(level=0) function. The groupby call is fast but the fill forward function applied on the "group by dataframe" is really too slow.
Here is the code I use to compare simple fill forward (which doesn't give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).
Here are the time results I have:
So, the fill foward on the group by dataframe is 10000 times slower than the simple fill forward. I cannot understand why pandas is running so slowly. I need to have comparable perfs with the simple fill forward so just a couple of milliseconds.
Could somebody address the performance issue? Or give me a solution to do this kind of action in a very efficient way?
Thanks
The text was updated successfully, but these errors were encountered: