view into modified dataframe of ints causes subsequent set_value to not work properly #10264

jbezaire · 2015-06-03T21:05:16Z

The following code snippet produces unexpected output:

import pandas as pd
import numpy as np
x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])
x['f'] = 0
x.f.values[3]=1
y = x.iloc[np.arange(2,len(x))]
x.f.values[3]=2
print x
print x.f.values

Out[130]: 
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  1
4  0  0  0  0  0  0
[0 0 0 2 0]

The expected output is:

Out[131]: 
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  2
4  0  0  0  0  0  0
[0 0 0 2 0]

The behavior only manifests if the dataframe has had a column added to it (f in this case), if the columns are ints, and if the view is constructed with certain kinds of indexing (y=x.iloc[2:] works fine for instance).

Here is my version info:

INSTALLED VERSIONS

commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.4.2.el7.jump.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2.dev
nose: 1.3.4
Cython: 0.20.2
numpy: 1.9.0
scipy: 0.14.1rc1.dev-Unknown
statsmodels: None
IPython: 3.1.0
sphinx: 1.3b1
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.10
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.1.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: 2.4.3
sqlalchemy: None
pymysql: 0.6.2.None
psycopg2: 2.5.4 (dt dec pq3 ext)

The text was updated successfully, but these errors were encountered:

shoyer · 2015-06-03T21:32:12Z

I can reproduce this on master. Definitely a bug -- thanks for the report!

jreback · 2015-06-03T22:41:35Z

I know this looks like a bug, and might be. However, you are violating the guarantees. The .values objects does not guarantee that you will get a referenceable consolidated view for a DataFrame.

For a Series this is always True, and generally it is true for a DataFrame when looking at a single dtype (but this is NEVER true for multi-dtype selection as you will always get a copy), but for a single dtype view (or a view on a multi-dtype that only is a single dtype).

This implementation is lazy about consolidation (e.g. it won't actually consolidate on a single insert like this, but will later on).

So we could call it a bug if it should always consolidate on inserts (but this can severly impact performance, that's why its like this). Or say that what you are doing (e.g. by ASSIGNING to .values directly), is technically not allowed (and in any event its not good practice ever).

In [34]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])

In [35]: x._data
Out[35]: 
BlockManager
Items: Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 5, 1), 5 x 5, dtype: int64

In [36]: x['f'] = 0

In [37]: x._data
Out[37]: 
BlockManager
Items: Index([u'a', u'b', u'c', u'd', u'e', u'f'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 5, 1), 5 x 5, dtype: int64
IntBlock: slice(5, 6, 1), 1 x 5, dtype: int64

In [38]: x.f.values.base
Out[38]: array([0, 0, 0, 0, 0])

In [39]: x.e.values.base
Out[39]: 
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])
In [41]: x._consolidate_inplace()

In [42]: x._data
Out[42]: 
BlockManager
Items: Index([u'a', u'b', u'c', u'd', u'e', u'f'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 6, 1), 6 x 5, dtype: int64

In [43]: x.values.base
Out[43]: 
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

shoyer · 2015-06-03T22:43:40Z

@jreback the bug (IMO) is that x and x.f are inconsistent. Assigning to .values is a bit of a side point.

jreback · 2015-06-03T22:46:39Z

@shoyer you'd have to prove it. How are they inconsistent?

shoyer · 2015-06-03T22:50:28Z

@jreback compare lines 8 and 9 below:


In [2]: import numpy as np

In [3]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])

In [4]: x['f'] = 0

In [5]: x.f.values[3]=1

In [6]: y = x.iloc[np.arange(2,len(x))]

In [7]: x.f.values[3]=2

In [8]: x
Out[8]:
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  1
4  0  0  0  0  0  0

In [9]: x.f
Out[9]:
0    0
1    0
2    0
3    2
4    0
Name: f, dtype: int64

jreback · 2015-06-03T22:52:44Z

you are using .values, which violates guarantees

jreback · 2015-06-03T22:54:03Z

In [27]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])

In [28]: x['f'] = 0

In [29]: x.ix[3,'f'] = 2

In [30]: x
Out[30]: 
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  2
4  0  0  0  0  0  0

In [31]: x['f']
Out[31]: 
0    0
1    0
2    0
3    2
4    0
Name: f, dtype: int64

shoyer · 2015-06-03T23:02:29Z

you are using .values, which violates guarantees

It should not be possible to put a dataframe into an inconsistent state, even if you're mis-using the pandas API.

jreback · 2015-06-03T23:03:59Z

The state IS consistent. The issue is that the user is holding onto an incorrect reference. This is exactly SettingWithCopyError.

TomAugspurger · 2015-06-04T02:01:33Z

From the docs

If a DataFrame or Panel contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure.

So we'll need to update that or call this a bug. Perhaps we should remove that section either way and encourage users to use .loc for setting?

shoyer · 2015-06-04T02:10:40Z

@jreback what is the incorrect reference? Do we cache x.f instead of computing it on demand?

jreback · 2015-06-04T12:11:30Z

ok, guess was an odd bug, see #10272 . The code was basically wrong in 2 places, which means it worked almost all of the time.

jbezaire · 2015-06-04T19:06:15Z

Thank you for your prompt attention to this issue. Following your comments I'm dismayed to discover that assigning directly into x.f.values is considered "not good practice ever". I transitioned to using x.f.values exclusively after upgrading from 0.12 to 0.15 and encountering significant performance issues using loc (I am processing very large dataframes) and becoming very confused by issues surrounding chained indexing warnings. I thought that by using values directly I was reaching into the dataframe object and reading or inserting into the underlying numpy arrays directly (which is of course MUCH faster).

Inserting values into dataframes using loc can have very unintended consequences, and I have been bitten by it often enough that I've just stopped doing it. For example:

In [101]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'], index=['one','two','three','four','five'])

In [102]: print x
       a  b  c  d  e
one    0  0  0  0  0
two    0  0  0  0  0
three  0  0  0  0  0
four   0  0  0  0  0
five   0  0  0  0  0

In [103]: print x.dtypes
a    int64
b    int64
c    int64
d    int64
e    int64
dtype: object

In [104]: x.loc['three','e']=3.1

In [105]: print x
       a  b  c  d    e
one    0  0  0  0  0.0
two    0  0  0  0  0.0
three  0  0  0  0  3.1
four   0  0  0  0  0.0
five   0  0  0  0  0.0

In [106]: print x.dtypes
a    float64
b    float64
c    float64
d    float64
e    float64
dtype: object

assigning a float value into column f here not only converts column f to a float64 column, but also converts all the other columns to floats as well, even though they were not touched by the assignment statement. The data I am processing often uses int64 values for enumerations (financial market order Ids), so casting to float64 doesn't just lose precision, it refers to the wrong order and so is dangerous and unacceptable.

The other reason I have gravitated away from using loc, iloc, and ix is that I often want to index into row by position, but column by name. I don't believe any of the dataframe access apis allow that combination (though I would love to be corrected). What would be the recommended way of setting a particular positional slice of column e to some value? Ie what would be good practice version of:

x.e.values[2:5] = np.arange(2,5)

and again, I am drawn to this version because it is extremely fast, but now I am worried that it will not always produce the expected results.
Thank you for taking the time to discuss this.
Regards,
Jeff

jreback · 2015-06-04T19:14:21Z

this last is a bug. haven't gotten around to fix it. However, you should know, that if you are doing performance sensistive things, should shouldn't be doing ANY ASSIGNMENT AT ALL.

Simply construct a new column like you want and assign it all at once. Profile your code.

jreback · 2015-06-04T19:26:15Z

@jbezaire see #10280 for that other issue (there might be another one, but can't find right now). Its actually a bit non-trivial to fix.

BUG: bug in cache updating when consolidating #10264

shoyer added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 3, 2015

jreback added a commit to jreback/pandas that referenced this issue Jun 4, 2015

BUG: bug in cache updating when consolidating pandas-dev#10264

f4fb777

jreback mentioned this issue Jun 4, 2015

BUG: bug in cache updating when consolidating #10264 #10272

Merged

jreback added this to the 0.16.2 milestone Jun 4, 2015

jreback added a commit to jreback/pandas that referenced this issue Jun 4, 2015

BUG: bug in cache updating when consolidating pandas-dev#10264

170e502

jreback added a commit to jreback/pandas that referenced this issue Jun 4, 2015

BUG: bug in cache updating when consolidating pandas-dev#10264

0dd5bd4

jreback mentioned this issue Jun 4, 2015

BUG: indexers coerce partial assignment of float->int #10280

Closed

jreback closed this as completed in #10272 Jun 5, 2015

jreback added a commit that referenced this issue Jun 5, 2015

Merge pull request #10272 from jreback/cache

e26b33c

BUG: bug in cache updating when consolidating #10264

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

view into modified dataframe of ints causes subsequent set_value to not work properly #10264

view into modified dataframe of ints causes subsequent set_value to not work properly #10264

jbezaire commented Jun 3, 2015

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

jreback commented Jun 3, 2015

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

TomAugspurger commented Jun 4, 2015

shoyer commented Jun 4, 2015

jreback commented Jun 4, 2015

jbezaire commented Jun 4, 2015

jreback commented Jun 4, 2015

jreback commented Jun 4, 2015

view into modified dataframe of ints causes subsequent set_value to not work properly #10264

view into modified dataframe of ints causes subsequent set_value to not work properly #10264

Comments

jbezaire commented Jun 3, 2015

INSTALLED VERSIONS

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

jreback commented Jun 3, 2015

shoyer commented Jun 3, 2015

jreback commented Jun 3, 2015

TomAugspurger commented Jun 4, 2015

shoyer commented Jun 4, 2015

jreback commented Jun 4, 2015

jbezaire commented Jun 4, 2015

jreback commented Jun 4, 2015

jreback commented Jun 4, 2015