Skip to content

DateTimeIndex values are assigned across entire df when using .loc #9478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alan-wong opened this issue Feb 12, 2015 · 6 comments · Fixed by #9479
Closed

DateTimeIndex values are assigned across entire df when using .loc #9478

alan-wong opened this issue Feb 12, 2015 · 6 comments · Fixed by #9479
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@alan-wong
Copy link

I posted a 2-part answer to this question on SO: http://stackoverflow.com/questions/28482553/pandas-set-value-of-column-to-value-of-index-based-on-condtion

What I noticed is that if your index is a datetimeindex then assigning the values is not respecting the column selection and is blatting all rows.

I am using pandas 0.15.2 using numpy 1.9.1 and python 3.4 64-bit

example:

In [46]:

rows = 3
df = pd.DataFrame(np.random.randn(rows,2), columns=list('AB'), index=pd.date_range('1/1/2000', periods=rows, freq='1H'))
print(df)
df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = df.loc[df.A > 0.5].index
df
                            A         B
2000-01-01 00:00:00 -0.761643  0.969167
2000-01-01 01:00:00  0.050335 -1.346953
2000-01-01 02:00:00  0.663857 -0.272247
Out[46]:
                             A                             B  \
2000-01-01 00:00:00 1970-01-01           1970-01-01 00:00:00   
2000-01-01 01:00:00 1970-01-01 1969-12-31 23:59:59.999999999   
2000-01-01 02:00:00 1970-01-01           1970-01-01 00:00:00   

                    LAST_TIME_A_ABOVE_X  
2000-01-01 00:00:00                 NaT  
2000-01-01 01:00:00                 NaT  
2000-01-01 02:00:00 2000-01-01 02:00:00 

If the index is an Int64 type then this doesn't happen.

If you reset the index and then assign the values using .loc then it works correctly

@alan-wong alan-wong changed the title DateTimeIndex values is assigned across entire df when using .loc DateTimeIndex values are assigned across entire df when using .loc Feb 12, 2015
@jreback
Copy link
Contributor

jreback commented Feb 12, 2015

This is kind of tricky. You are assigning possibly a single value (or maybe a list, depends on what A is). with a list-like, IOW, and index (regardless of the value of A, it may have 0 or more elements).
If its a list on the rhs, then it will try to broadcast it, that's why you get values in every column.

So I would say this is not really safe to do

In [57]: df = pd.DataFrame(np.random.randn(rows,2), columns=list('AB'), index=pd.date_range('1/1/2000', periods=rows, freq='1H'))

In [58]: df
Out[58]: 
                            A         B
2000-01-01 00:00:00  0.117055  0.554529
2000-01-01 01:00:00 -1.587738 -0.913139
2000-01-01 02:00:00  1.439404 -0.521966

In [59]: df.loc[df.A>0.5].index[0]
Out[59]: Timestamp('2000-01-01 02:00:00', offset='H')

In [60]: df.loc[df.A>0.5,'new'] = (df.loc[df.A>0.5].index)[0]

In [61]: df
Out[61]: 
                            A         B                 new
2000-01-01 00:00:00  0.117055  0.554529                 NaT
2000-01-01 01:00:00 -1.587738 -0.913139                 NaT
2000-01-01 02:00:00  1.439404 -0.521966 2000-01-01 02:00:00

Here is a better/safe way to do this.

In [91]: df['new'] = df.index.to_series().where(df.A>0.5)

In [92]: df
Out[92]: 
                            A         B        new
2000-01-01 00:00:00  0.559047 -0.366489 2000-01-01
2000-01-01 01:00:00  0.249399 -0.780957        NaT
2000-01-01 02:00:00 -1.244441  1.364961        NaT

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 12, 2015
@alan-wong
Copy link
Author

OK but is there a reason why the datetimeindex behaves differently to when it's just an ordinary column of datetime64 dtype? For instance after df creation you then did df.reset_index(inplace=True) temp = df.loc[df.A > 0.5,'index'] df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = temp then this works as expected

@jreback
Copy link
Contributor

jreback commented Feb 12, 2015

@alan-wong what you are posting is not the same thing. you are assigning with an aligned series. this makes all the difference. The purpose of the index is to align things. When you use df.loc[df.A>0.5].index you get a DatetimeIndex, which is not aligned to anything, its just like a list of values.

In [131]: mask=df.A>0.5

In [132]: mask
Out[132]: 
2000-01-01 00:00:00     True
2000-01-01 01:00:00    False
2000-01-01 02:00:00    False
Freq: H, Name: A, dtype: bool

In [133]: df.loc[mask,'new'] = df.loc[mask].index.tolist()

In [134]: df
Out[134]: 
                            A         B        new
2000-01-01 00:00:00  1.321158 -1.546906 2000-01-01
2000-01-01 01:00:00 -0.202646 -0.655969        NaT
2000-01-01 02:00:00  0.193421  0.553439        NaT

I'll mark this as a bug, it seems the index (on the rhs) is not being treated as list-like, and instead maybe its trying to align.

@jreback jreback added the Bug label Feb 12, 2015
@alan-wong
Copy link
Author

OK I see the semantic difference but just to point out that you don't observe this behaviour if the index is int64 : rows=3 df = pd.DataFrame(np.random.randn(rows,2), columns=list('AB'), index=np.arange(rows)) df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = df.loc[df.A > 0.5].index works as expected which is why I think this is something specific with datetimeindex

@jreback
Copy link
Contributor

jreback commented Feb 13, 2015

@alan-wong ok, fixed up in master.

@alan-wong
Copy link
Author

Cheers Jeff, sterling work as always

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants