Skip to content

BUG: reindex on index in a frame using a not-None method is buggy #5669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jreback opened this issue Dec 9, 2013 · 9 comments
Open

BUG: reindex on index in a frame using a not-None method is buggy #5669

jreback opened this issue Dec 9, 2013 · 9 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@jreback
Copy link
Contributor

jreback commented Dec 9, 2013

reported here:

http://stackoverflow.com/questions/20459782/what-is-the-functionality-of-the-filling-method-when-reindexing

does not appear to be any tests for it, nor is fix that trivial

import pandas as pd
import numpy as np

hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')
daily1 = hf.reindex(index=ind_daily, method='ffill')
@behzadnouri
Copy link
Contributor

I do not think this is a bug, and the current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:

       A      B      C
1  1.242    NaN  0.110
3    NaN -0.185 -0.209
5 -0.581  1.483    NaN

and i want to keep all nan as nan, it makes much more sense to have:

 df.reindex( [2, 4, 6], method='ffill' )
        A      B      C
2  1.242    NaN  0.110
4    NaN -0.185 -0.209
6 -0.581  1.483    NaN

just take whatever value there is ( nan or not nan ) and fill forward until the next available index. This is completely different from

df.reindex( [2, 4, 6], method=None )

which produces

    A   B   C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN

TLDR: in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. otherwise, you have no choice but to fill 'nan' values simply because you want to reindex. Reindexing should not enforce a mandatory fillna on the data.

@jreback
Copy link
Contributor Author

jreback commented Dec 9, 2013

df.reindex(new_index,method=''ffill') should be equivalent of df.reindex(new_index).ffill()

your middle example can be done by taking a union of the existing indices and the new, then forward-filling

you can also specify a fill_value if you want something other than nan

@behzadnouri
Copy link
Contributor

"your middle example can be done by taking a union of the existing indices and the new, then forward-filling"

why?! why this way?

just keep reindexing as reindexing, and fillna as fillna. a "nan" value can be an actual valid value, when you reindex with ffill you want to take all the actual values ( nan or not nan) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.

@jreback
Copy link
Contributor Author

jreback commented Dec 9, 2013

I am not sure what you mean by this. Reindexing should not enforce a mandatory fillna on the data.

by definition np.nan is the marker for missing data. you have the option to provide a fill-forward if you want (or not); you can also fill with a specific value (fill_value=). But reindexing will by definition possibly create missing values. Not sure that you can have both a missing value and a np.nan (unless you want to swap the nan to something else first).

FYI, this behavior has been their since as far as I can remember. It IS tested for in the context of a non-monotonic index (this is an error), but not in the general case.

Series works this way, the bug is on DataFrame which does not.

@behzadnouri
Copy link
Contributor

If I want to reindex with ffill just to forward fill whatever value is in the original dataframe ( again nan or not nan ) until the next available index, but I do not want to fillna what should I do?

what you are saying is that, reindex will do the fillna for me and then I have to revert that.

Here is an example: np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.

If this is the behavior for the time series then maybe there shoudl be a bug report there.

If I want to fillna I can always call fillna directly.

@jreback
Copy link
Contributor Author

jreback commented Dec 9, 2013

see here for the current docs: http://pandas.pydata.org/pandas-docs/dev/basics.html#filling-while-reindexing
Series currently does this, DataFrame does not

can you make your words into an example and show me what you mean?

@behzadnouri
Copy link
Contributor

say you have trade data across markets, and you are measuring the correlations across these markets: Tokyo, London, New York, Chicago.

These market open and close at different hours during the day, so for example for periods during the day you can measure corr( New York, London ) but you just have to fill nan for corr( New York, Tokyo ) at say 11am EST simply because it is not possible to measure the correlation when the market is closed.

now, if I reindex the time series into a different frequency( say every half hour), it should not fill out nan values in the dataframe at the time the market is closed. It should just forward fill whatever is in the original dataframe.

is the example clear?

@dbew
Copy link
Contributor

dbew commented Jan 30, 2014

We came across this as well. We expected the behaviour similar to that @behzadnouri described i.e. forward fill within buckets. Hopefully the examples below explain what I mean by this.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10.), index=range(10), columns=['A'])
df.iloc[2] = np.nan
df.iloc[5:8] = np.nan
df
#     A
# 0   0
# 1   1
# 2 NaN
# 3   3
# 4   4
# 5 NaN
# 6 NaN
# 7 NaN
# 8 NaN
# 9   9


# Straight reindex, no fill. Value for 2, 6 and 8 should be np.nan.
df.reindex(range(0, 10, 2))
#     A
# 0   0
# 2 NaN
# 4   4
# 6 NaN
# 8 NaN


# reindex with ffill - current behaviour - gets the wrong value for 2.
df.reindex(range(0, 10, 2), method='ffill')
#     A
# 0   0
# 2 NaN   # should not be NaN
# 4   4
# 6 NaN
# 8 NaN


# reindex with ffill - expected behaviour - fill within "buckets" - so we expect
# value for 2 to be 1 (ffilled from 1) but values for 6, 8 to be NaN (no data to ffill).
df.reindex(range(0, 10, 2), method='ffill')
#     A
# 0   0
# 2   1
# 4   4
# 6 NaN
# 8 NaN


# behaviour when reindexing then ffilling - note that this is different to reindex with
# method='ffill' because we ffill *after* the reindex instead of during the reindex
# In particular the value for 2 is now 0 not 1 and the for 6 and 8 we have value 4.
df.reindex(range(0, 10, 2)).ffill()
#    A
# 0  0
# 2  0
# 4  4
# 6  4
# 8  4

@jreback
Copy link
Contributor Author

jreback commented Jan 30, 2014

Here's one way to do this; essentially your own groupby

In [36]: concat([df.loc[:4].ffill(),df.loc[5:].ffill()]).reindex(range(0,10,2))
Out[36]: 
    A
0   0
2   1
4   4
6 NaN
8 NaN

[5 rows x 1 columns]

Anotherway is to introduce a multindex where your data 'breaks' and treat the groups separately.
(this is not exactly your result, but close); MultiIndex.from_product is new in 0.13.1

In [55]: df.index = MultiIndex.from_product([list('ab'),list(range(5))])

In [56]: df
Out[56]: 
      A
a 0   0
  1   1
  2 NaN
  3   3
  4   4
b 0 NaN
  1 NaN
  2 NaN
  3 NaN
  4   9

[10 rows x 1 columns]

In [54]: df.groupby(level=0).apply(lambda x: x.ffill().reset_index(drop=True).reindex(range(0,5,2))).reset_index(drop=True)
Out[54]: 
    A
0   0
1   1
2   4
3 NaN
4 NaN
5   9

[6 rows x 1 columns]

HTH

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 6, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@mroeschke mroeschke removed the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 11, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

4 participants