-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Allow interpolate() to fill backwards as well as forwards #10691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I also have a local change to the documentation that adds an example of using the new kwarg, but I'd like to verify first that people think the behavior implemented here is good. |
From an API perspective, it would be nice if |
I would rather see a |
Thanks for the feedback! @jreback if there's a forward/backward flag (or ffill/bfill), then the interpolation can only fill in one direction at a time? So I would have to do two interpolations to fill values around non-NaNs? (This is currently the case for fillna as well, just confirming that's what you have in mind.) It would be nice if I could fill "forward", "backward", or "around" with just one call, since I can currently use your suggestion from #9218 and just reverse the sequence and interpolate (or fill) if I'm going to interpolate twice anyway. @shoyer I agree that it would be nice to respect passing the "backward_limit" kwarg only ... this was just my first attempt to put some code in place. It seems like we ought to work out this forward/backward/around issue before getting too detailed on the backward_limit. |
|
Yes, totally agree, api consistency is a good thing! I like the idea of adding What if |
Sorry for the noise, I got my branch in a weird state and did a hard reset. I changed the code so that limit accepts either an int (preserves current behavior) or a tuple of (backward, forward) ints. |
Any comments on this version of the change? |
@shoyer what you do you think? |
To me, That said, we are here where we are today, and we do want to make some effort to preserve backwards compatibility. I'm not a really big fan of using tuples instead of another keyword argument. It seems a little awkward and not so Pythonic. I'm also not sure if there are legitimate use cases for limiting the interpolation only in one direction (maybe?). If we don't care about asymmetrical limits, we could add a new keyword argument |
@shoyer I get what you're saying about using a tuple, it's a bit odd considering the way the rest of the pandas api works (at least what I've seen of it). If we want to add a new kwarg, it could be something categorical like I'd also be fine with your |
"If we want to add a new kwarg, it could be something categorical like This seems like a very good way to me. You can preserve backwards compatibility by setting the default to "forward" which also makes it obvious to a new user in the documentation why interpolate is only extrapolating in one direction (and how to change that). It's also consistent with FWIW, I think allowing asymmetric behavior is really important overall for this command. I don't think having different limits is all that important though. That is, a user will often want to interpolate forwards or backwards (but not both). In cases where the user wants to go in both directions I'd expect a symmetric limit is fine. I.e. I think it would be fairly uncommon to want to set a limit of, say, 2 in one direction and 5 in the other. And on those rare occasions, I think just running 2 separate interpolations is not that onerous. |
I would be +1 on adding |
I've updated the change in my local branch to add a The change does not change the current behavior at all when I'm not such a big fan of I thought a bit about a |
@lmjohns3 Sorry about the Regarding extrapolation, I think things are still a little strange in this sense. If you do the simplest version But I don't mean to further complicate the current issue. One step at a time and I think this is a very good change you are making.
|
@johne13 This is a good point you bring up, and might require a bit more thought/testing (but I'm not sure!). It looks like interpolate() actually is a sort of wrapper for several different interpolation methods. The change as it currently stands only applies to the variant where |
I just checked further and the code is a bit wacky, but it looks like this change should be ok. There are basically two branches inside the There's also a The interpolate methods are called from a number of different places, but the main one I'm after is If anyone is more familiar with the API and this raises any red flags let me know! |
Any further comments on this PR? |
ser.interpolate(limit=2) | ||
|
||
By default, ``limit`` applies in a forward direction, so that only ``NaN`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would make a sub-section (under interpolation).
Thanks for the comments! I've addressed the test issues, and I've tried to update the docs in the way that I think you were describing. I made a new "interpolation limits" section in the "missing_data.rst" file, but I'm unsure if it's kosher -- I used ^^^^^ for the headline level, but there are no other headlines in any of the other docs that use that level. Maybe I should just make the section header bold? Likewise in the whatsnew file I added a small example, but if anything about the formatting etc. looks weird to you let me know. |
Any further comments on this? And suggestions for the ^^^^ header level? |
@lmjohns3 this is what I do, others just add everything in their own commit. Its sort of up to you when deving. Eventually I'll have you squash if you didn't, so this is sort of a continuous squash. |
lgtm. |
limit_direction='symmetric') | ||
assert_series_equal(result, expected) | ||
|
||
expected = Series([1., 3., np.nan, 7., 9., 11.]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we don't have a full tests for #10420. E.g. I think Series([np.nan.....])
is the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good catch. I'll add that as soon as I can, but probably won't be until Sunday.
@lmjohns3 need validation on the |
OK, I've made a number of updates to the patch that address the comments above.
In dealing with the case of filling a series backward to the beginning, I had to muck around with the interpolate_1d function a good bit more than before. This makes the diff quite ugly ... I'd be happier with it if I could check in a separate PR first that does the code-rearranging part of things. Any opinions on how to do that? Or is everyone comfortable with an ugly change that is nonetheless pretty well tested? Thanks for everyone's feedback on this by the way! |
Redundant "both" and "symmetric" arguments seems like a bad idea. I would On Tue, Aug 18, 2015 at 11:11 AM, Leif Johnson [email protected]
|
Done -- it's just 'both' now. |
I was also a bit surprised by this change, because it turns out that both To get a true extrapolation you'd have to use something like this:
But at least it's possible now! |
hmm, can you show what you mean (e.g. using numpy/scipy) and the new method side-by-side in a small example? |
if not valid.any(): | ||
# have to call np.array(xvalues) since xvalues could be an Index | ||
# which cant be mutated | ||
result = np.empty_like(np.array(xvalues), dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do np.asarray(xvalues)
I think.
you need some tests where you have multiple groups of
and we need some cases for testing where you have multiple groups some of which satisfy the limit and some don't. I dont think your algorithm will handle these ATM. |
@lmjohns3 any luck with this? |
Sorry, I've had this on hold. I'm supposed to turn in a draft of my dissertation this week. I will have some more time to work on this next week. |
This change adds a new keyword argument for specifying the direction in which an interpolation limit ought to be applied. The default, when ``limit_direction`` is 'forward', fills in NaN values after non-NaNs, for up to ``limit`` consecutive fills. No interpolated values are used to fill NaNs before non-NaNs. This matches the current behavior for interpolate. If ``limit_direction`` is 'backward', the interpolation will fill in up to ``limit`` NaN values *before* non-NaN values, for up to ``limit`` consecutive fills. No interpolated values will be used to fill NaNs after non-NaNs. If ``limit_direction`` is another value, the interpolation will fill in NaN values both before and after non-NaNs, up to ``limit`` in each direction, so that up to ``2 * limit`` consecutive values will be filled in around each non-NaN segment.
I just checked in a few changes to address the comments above. It looks to me like the algorithm does work for gaps of any length. In the example you suggested here is the behavior:
In the first example, all values are filled, because none of the NaNs is further than 2 from a non-NaN value. In the second, the "5" value remains a NaN because that slot is further than 1 from all non-NaN values. I added these two examples as tests. Also, here's an example of the "filling" behavior of
To me this says, "I've observed value ( But if you think these values are actually sampled from some continuous underlying function, and you want to reconstruct this function even outside the [3, 5] interval, you have to use a spline interpolation:
This works pretty well in
You have to specify |
@lmjohns3 ok, I see you added those results as tests. gr8! ping me when green and I'll have another look. |
Looks like the tests have passed! |
Allow interpolate() to fill backwards as well as forwards
thanks @lmjohns3 nice change! |
Thanks for the guidance!! |
Looking at the "what's new" doc I see the following example:
Given the description of the method I would have expected:
Am I wrong? Why do we fill the first element? |
Your expectation seems correct to me -- the first element is further than 1 from a non-NaN value and doesn't seem like it should be filled. I'll look into it and see if something is going haywire in the code, or if it's just a documentation issue. |
Yes, this is buggy, excellent find! The method for computing the NaN mask got bogged down when the NaN occurs within |
closes #9218
closes #10420
Edit: changed approach, see the rest of the thread
Currently, interpolate() has a "limit" kwarg that, when set to n,
prevents interpolated values from propagating more than n rows forward.
This change adds a "backward_limit" kwarg that, when set to n, prevents
interpolated values from propagating more than n rows backward.
The behavior prior to this change is as though "backward_limit" existed
but was always set to 0. That is, interpolated values never filled in
NaNs immediately before a non-NaN value (unless those same NaNs happened
to follow a different, non-NaN value within "limit" rows).
In this change the "backward_limit" kwarg has no effect if "limit" is
not specified (i.e., None).