-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH/REF: More options for interpolation and fillna #4915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1164,6 +1164,42 @@ def backfill_2d(values, limit=None, mask=None): | |||
pass | |||
return values | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the these routines to core/algorithms.py (not that we have more than one). in case you need helper functions will be easier
separate the 1-d and 2-d cases: if you are doing 1-d interp, e.g. a Series, (which can be also be done by applying interpolate on a frame) then you are just using 1-d days, with an index (x), and values (y) which produce new values which you return as the new values.the y have nans (if they don't it should just pass thru, right?) 2-d is a completely different case, what does scipy do here? like a grid interp? (obviously these only work on frames) also....not sure you even need to fill at all, isn't that the point of interp? (or is it possible that nans are returned from these routines? |
The real difference between 1d and 2d is that the function object returned by
The interpolated values shouldn't normally contain |
@TomAugspurger can you give an example in action, maybe am confused, but I thought the idea is to:
after interpolate
|
That would be correct. But I think I've seen questions on SO about how to do something like: >>> s = Series([0,1, 2, 3])
>>> s.interpolate([0.5, 1.5, 2.5]) # interpolate what the Series would be at these points.
0.5 0.5
1.5 1.5
2.5 2.5
dtype: float64 |
Do you see a use for that? If so does it belong under interpolate or elsewhere? I guess the connection to filling In [66]: s = Series([0, 1, 2, 3])
In [67]: s.reindex([0, .5, 1, 1.5, 2, 2.5, 3]).interpolate()
Out[67]:
0.0 0.0
0.5 0.5
1.0 1.0
1.5 1.5
2.0 2.0
2.5 2.5
3.0 3.0
dtype: float64 Maybe thats a better way to think about it. |
so maybe a method signature of this
makes sense my example would be ? |
I don't your 2nd example (reindex and fill) can be done exactly like that if the user wants
|
actually.....this might be easy, if |
I would add one more argument to the function signature: a way to specify what values to use for the x-values. By default xvalues would be the index, like we've been assuming. But if you have a DataFrame where you want the x-values to be column
Gives something like In [73]: df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [4, 5, 6, np.nan], 'C': [1, 2, 3, 5]})
In [74]: df
Out[74]:
A B C
0 1 4 1
1 2 5 2
2 NaN 6 3
3 4 NaN 5
In [75]: df.interpolate(xvalues='C', values='B')
Out[75]:
1 4
2 5
3 6
5 8 |
I'm liking the idea of reindexing and filling. It's helps clarify why my two use cases (filling NaNs in an existing frame and interpolation at new values) are really the same. |
why don't we call it I think
so cases are:
so most of this is really just argument interpretation.... |
Yes on index for x-values, values for y-values. That will be nice for 2d interpolation too.
In [86]: s = Series([0, 1, 2, 3])
In [87]: s.interpolate(values=[1.5, 2.5])
1.5 1.5
2.5 2.5
dtype: float64 I'lll need to think how this one is implemented.
|
your example for 2) doesn't fit with reindexing (well its not really indexing, more of setting an index) I think you need to require the index to be the same length, how else would you map it (and if you really did want to map it differently, then the user should fix the series first) |
Actually hold on, there's potentially another case
In [92]: df2 = pd.DataFrame({'A': [2, 4, 6, 8], 'B': [1, 4, 9, 16]})
In [93]: df2
Out[93]:
A B
0 2 1
1 4 4
2 6 9
3 8 16
In [94]: df2.interpolate(index='A', values='B', new_values=[2.5, 4.5, 6.5])
Out [94]:
2.5 1.75
4.5 5.25
6.5 10.75
This is getting a bit hairy. Maybe we should limit the scope for now. Add features as needed. |
This is related to your last comment @jreback so I guess your answer there applies to my last post. My disagreement with your 2.) came from me mixing up the |
@TomAugspurger not sure I buy that last case.....not even sure how that maps.....simple to start you could always support values as say a dict if you need something like that (but later) |
is |
Stepping down from the abstraction the arguments would be:
plus the Maybe I'll start with the nan-filling behavior first and not having a |
Tentative signature and docstring def interpolate(self, index=None, values=None, new_values=None,
method='linear', inplace=False, limit=None, axis=0):
"""Interpolate values according to different methods.
Parameters
----------
index : arraylike. The domain of the interpolation. Uses the
Series' or DataFrame's index by default.
values : arraylike. The range of the interpolation. Uses the values
in a Series or DataFrame by default. Can also be a column name.
index and values *must* be of the same length.
new_values : arraylike or None.
If new_values is None, will fill NaNs.
If new_values is an array, will return a Series containing
the interpolated values and whose index is new_values.
method : str or int. One of {'linear', 'time', 'values' 'nearest',
'zero', 'slinear', 'quadratic', 'cubic'}. Or an integer
specifying the order of the spline interpolator to use. Is linear
by default. Some of the methods require scipy. TODO: Specify which ones.
inplace : bool, default False
limit : int, default None. Maximum number of NaNs to fill.
axis : int, default 0
Returns
-------
if new_values is None:
Series or Frame of same shape with NaNs filled
else:
Series with index new_values
See Also
--------
reindex, replace, fillna
Examples
--------
# Filling in NaNs:
>>> s = pd.Series([0, 1, np.nan, 3])
# index=s.index, values=s.values; new_values is None so filling NaNs
>>> s.interpolate()
0 0
1 1
2 2
3 3
dtype: float64
# Linear interpolation on Series at new values
>>> s = pd.Series([0, 1, 2, 3])
>>> s.interpolate(new_values=[0.5, 1.5, 2.5])
0.5 0.5
1.5 1.5
2.5 2.5
dtype: float64
# Using two columns from a DataFrame
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'Y': [1, 5, 9, np.nan]})
>>> df.interpolate(index='A', values='Y') # fill the NaN
A Y
0 1 1
1 2 5
2 3 9
3 4 13
""" |
what does uses the 'values' in a DataFrame mean? I wouldn't allow e.g. still fuzzy on the good docstring writing! |
how does your last example decide 13 is the value? (I mean looking at it I get it, but how programatically is that the case)? |
Identical to applying Series.interpolate() to every (numeric) column of the DataFrame. Or a list of column names to apply it to.
I made the last example up, and after looking at current behavior I think it's wrong. Currently
Which is the easy way to do things! The scipy methods would do the same thing by default. |
so fyi...if method is |
I think In [8]: s = Series([0, 1, 2, np.nan, np.nan, 5])
In [9]: s.interpolate()
Out[9]:
0 0
1 1
2 2
3 3
4 4
5 5
dtype: float64 |
ok..sounds ok then get basics working (e.g. linear), then adding scipy functions should be striaght forward |
One nasty thing the way In [8]: s = Series([0, 1, 2, np.nan, np.nan, 5])
In [9]: s.interpolate()
Out[9]:
0 0
1 1
2 2
3 3
4 4
5 5
dtype: float64 will return the same as In [11]: s = Series([0, 1, 2, np.nan, np.nan, 5], index=[1, 2, 4, 7, 11, 16])
In [12]: s
Out[12]:
1 0
2 1
4 2
7 NaN
11 NaN
16 5
dtype: float64
In [13]: s.interpolate()
Out[13]:
1 0
2 1
4 2
7 3
11 4
16 5
dtype: float64 i.e. its treating each value as "equally spaced". If we are treating interpolation the way I envision we'd expect In [11]: s = Series([0, 1, 2, np.nan, np.nan, 5], index=[1, 2, 4, 7, 11, 16])
In [12]: s.interpolate()
1 0
2 1
4 2
7 2.75
11 3.75
16 5
dtype: float64 This may mean we have to tweak the |
why don't you call the exising |
That sounds like a good idea. I'll think about the names. |
Ah never-mind! Wes already solved this for us. One of the possible values to give to the original |
"PCHIP interpolation.") | ||
|
||
interp1d_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic'] | ||
if method in interp1d_methods or isinstance(method, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this to validate order
if method='spline'
btw @TomAugspurger i'm just being a hardass ... i think this is great stuff ... very useful! |
echo.. @cpcloud these are mostly just nitpicks in any event! |
I'll never turn down free advice on programming. There's plenty to learn. |
@jreback Your prompt to allow kwargs reminded me that I forget to wrap Sphinx doesn't seem to like that I added the kwargs to interpolate. It's claiming that |
Let me know when I should rebase and squash again. |
new_x = xvalues[invalid] | ||
|
||
if method == 'time': | ||
if not xvalues.is_all_dates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do it like this:
if not getattr(xvalues,'is_all_dates',None)
because if for some reason xvalues is an ndarray
this would fail
@TomAugspurger you can squash and rebase when you are ready you said above that the sphix is complaining? |
its not picking up the current version when you build the docs....(its using your 0.12 version) in doc/source/conf.py
at the top of the file (which I then also have to take out as its not part of the default) I think @cpcloud has a better way though |
@TomAugspurger how are you building them? with |
|
can u try |
ENH: the interpolate method argument can take more values for various types of interpolation REF: Moves Series.interpolate to core/generic. DataFrame gets interpolate CLN: clean up interpolate to use blocks ENH: Add additonal 1-d scipy interpolaters. DOC: examples for df interpolate and a plot DOC: release notes DOC: Scipy links and more expanation API: Don't use fill_value BUG: Raise on panels. API: Raise on non monotonic indecies if it matters BUG: Raise on only mixed types. ENH/DOC: Add `spline` interpolation. DOC: naming consistency
|
@TomAugspurger gr8...just make sure to take it out (or I can do it for you when we merge)..... I just put it in to build the docs (then take it out)... |
@@ -174,6 +174,8 @@ Improvements to existing features | |||
- :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table | |||
from semi-structured JSON data. :ref:`See the docs<io.json_normalize>` (:issue:`1067`) | |||
- ``DataFrame.from_records()`` will now accept generators (:issue:`4910`) | |||
- ``DataFrame.interpolate()`` and ``Series.interpolate()`` have been expanded to include | |||
interpolation methods from scipy. (:issue:`4915`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to issues 4434, 1892 as that's what this is actually closing
merged via aff7346 thanks @TomAugspurger awesome job! |
Thanks for all the guidance / patience. |
docs are up and look good! http://pandas.pydata.org/pandas-docs/dev/missing_data.html#interpolation
|
Mistake on my part. I renamed all the Should I make a new PR to fix that quick? |
sure |
closes #4434
closes #1892
I've basically just hacked out the Series interpolate and stuffed it under generic under a big
if
statement. Gonna make that much cleaner. Moved the interpolation and fillna specific tests intest_series.py
totest_generic.py
.API question for you all. The interpolation procedures in Scipy take an array of x-values and an array of y-values that form the basis for the interpolation object. The interpolation object can then be evaluated wherever, but it maps X -> Y; f(x-values) == y-values. So we have 3 arrays to deal with:
Preferences for names? The other issue is for defaults. Right now I'm thinking
new values
if an array is given.