Skip to content

view into modified dataframe of ints causes subsequent set_value to not work properly #10264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbezaire opened this issue Jun 3, 2015 · 15 comments · Fixed by #10272
Closed
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@jbezaire
Copy link

jbezaire commented Jun 3, 2015

The following code snippet produces unexpected output:

import pandas as pd
import numpy as np
x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])
x['f'] = 0
x.f.values[3]=1
y = x.iloc[np.arange(2,len(x))]
x.f.values[3]=2
print x
print x.f.values
Out[130]: 
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  1
4  0  0  0  0  0  0
[0 0 0 2 0]

The expected output is:

Out[131]: 
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  2
4  0  0  0  0  0  0
[0 0 0 2 0]

The behavior only manifests if the dataframe has had a column added to it (f in this case), if the columns are ints, and if the view is constructed with certain kinds of indexing (y=x.iloc[2:] works fine for instance).

Here is my version info:

INSTALLED VERSIONS

commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.4.2.el7.jump.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2.dev
nose: 1.3.4
Cython: 0.20.2
numpy: 1.9.0
scipy: 0.14.1rc1.dev-Unknown
statsmodels: None
IPython: 3.1.0
sphinx: 1.3b1
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.10
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.1.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: 2.4.3
sqlalchemy: None
pymysql: 0.6.2.None
psycopg2: 2.5.4 (dt dec pq3 ext)

@shoyer shoyer added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 3, 2015
@shoyer
Copy link
Member

shoyer commented Jun 3, 2015

I can reproduce this on master. Definitely a bug -- thanks for the report!

@jreback
Copy link
Contributor

jreback commented Jun 3, 2015

I know this looks like a bug, and might be. However, you are violating the guarantees. The .values objects does not guarantee that you will get a referenceable consolidated view for a DataFrame.

For a Series this is always True, and generally it is true for a DataFrame when looking at a single dtype (but this is NEVER true for multi-dtype selection as you will always get a copy), but for a single dtype view (or a view on a multi-dtype that only is a single dtype).

This implementation is lazy about consolidation (e.g. it won't actually consolidate on a single insert like this, but will later on).

So we could call it a bug if it should always consolidate on inserts (but this can severly impact performance, that's why its like this). Or say that what you are doing (e.g. by ASSIGNING to .values directly), is technically not allowed (and in any event its not good practice ever).

In [34]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])

In [35]: x._data
Out[35]: 
BlockManager
Items: Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 5, 1), 5 x 5, dtype: int64

In [36]: x['f'] = 0

In [37]: x._data
Out[37]: 
BlockManager
Items: Index([u'a', u'b', u'c', u'd', u'e', u'f'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 5, 1), 5 x 5, dtype: int64
IntBlock: slice(5, 6, 1), 1 x 5, dtype: int64

In [38]: x.f.values.base
Out[38]: array([0, 0, 0, 0, 0])

In [39]: x.e.values.base
Out[39]: 
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])
In [41]: x._consolidate_inplace()

In [42]: x._data
Out[42]: 
BlockManager
Items: Index([u'a', u'b', u'c', u'd', u'e', u'f'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 6, 1), 6 x 5, dtype: int64

In [43]: x.values.base
Out[43]: 
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

@shoyer
Copy link
Member

shoyer commented Jun 3, 2015

@jreback the bug (IMO) is that x and x.f are inconsistent. Assigning to .values is a bit of a side point.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2015

@shoyer you'd have to prove it. How are they inconsistent?

@shoyer
Copy link
Member

shoyer commented Jun 3, 2015

@jreback compare lines 8 and 9 below:


In [2]: import numpy as np

In [3]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])

In [4]: x['f'] = 0

In [5]: x.f.values[3]=1

In [6]: y = x.iloc[np.arange(2,len(x))]

In [7]: x.f.values[3]=2

In [8]: x
Out[8]:
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  1
4  0  0  0  0  0  0

In [9]: x.f
Out[9]:
0    0
1    0
2    0
3    2
4    0
Name: f, dtype: int64

@jreback
Copy link
Contributor

jreback commented Jun 3, 2015

you are using .values, which violates guarantees

@jreback
Copy link
Contributor

jreback commented Jun 3, 2015

In [27]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'])

In [28]: x['f'] = 0

In [29]: x.ix[3,'f'] = 2

In [30]: x
Out[30]: 
   a  b  c  d  e  f
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
3  0  0  0  0  0  2
4  0  0  0  0  0  0

In [31]: x['f']
Out[31]: 
0    0
1    0
2    0
3    2
4    0
Name: f, dtype: int64

@shoyer
Copy link
Member

shoyer commented Jun 3, 2015

you are using .values, which violates guarantees

It should not be possible to put a dataframe into an inconsistent state, even if you're mis-using the pandas API.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2015

The state IS consistent. The issue is that the user is holding onto an incorrect reference. This is exactly SettingWithCopyError.

@TomAugspurger
Copy link
Contributor

From the docs

If a DataFrame or Panel contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure.

So we'll need to update that or call this a bug. Perhaps we should remove that section either way and encourage users to use .loc for setting?

@shoyer
Copy link
Member

shoyer commented Jun 4, 2015

@jreback what is the incorrect reference? Do we cache x.f instead of computing it on demand?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

ok, guess was an odd bug, see #10272 . The code was basically wrong in 2 places, which means it worked almost all of the time.

jreback added a commit to jreback/pandas that referenced this issue Jun 4, 2015
jreback added a commit to jreback/pandas that referenced this issue Jun 4, 2015
@jbezaire
Copy link
Author

jbezaire commented Jun 4, 2015

Thank you for your prompt attention to this issue. Following your comments I'm dismayed to discover that assigning directly into x.f.values is considered "not good practice ever". I transitioned to using x.f.values exclusively after upgrading from 0.12 to 0.15 and encountering significant performance issues using loc (I am processing very large dataframes) and becoming very confused by issues surrounding chained indexing warnings. I thought that by using values directly I was reaching into the dataframe object and reading or inserting into the underlying numpy arrays directly (which is of course MUCH faster).

Inserting values into dataframes using loc can have very unintended consequences, and I have been bitten by it often enough that I've just stopped doing it. For example:

In [101]: x = pd.DataFrame(data=np.zeros((5,5),dtype=np.int),columns=['a','b','c','d','e'], index=['one','two','three','four','five'])

In [102]: print x
       a  b  c  d  e
one    0  0  0  0  0
two    0  0  0  0  0
three  0  0  0  0  0
four   0  0  0  0  0
five   0  0  0  0  0

In [103]: print x.dtypes
a    int64
b    int64
c    int64
d    int64
e    int64
dtype: object

In [104]: x.loc['three','e']=3.1

In [105]: print x
       a  b  c  d    e
one    0  0  0  0  0.0
two    0  0  0  0  0.0
three  0  0  0  0  3.1
four   0  0  0  0  0.0
five   0  0  0  0  0.0

In [106]: print x.dtypes
a    float64
b    float64
c    float64
d    float64
e    float64
dtype: object

assigning a float value into column f here not only converts column f to a float64 column, but also converts all the other columns to floats as well, even though they were not touched by the assignment statement. The data I am processing often uses int64 values for enumerations (financial market order Ids), so casting to float64 doesn't just lose precision, it refers to the wrong order and so is dangerous and unacceptable.

The other reason I have gravitated away from using loc, iloc, and ix is that I often want to index into row by position, but column by name. I don't believe any of the dataframe access apis allow that combination (though I would love to be corrected). What would be the recommended way of setting a particular positional slice of column e to some value? Ie what would be good practice version of:

x.e.values[2:5] = np.arange(2,5)

and again, I am drawn to this version because it is extremely fast, but now I am worried that it will not always produce the expected results.
Thank you for taking the time to discuss this.
Regards,
Jeff

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

this last is a bug. haven't gotten around to fix it. However, you should know, that if you are doing performance sensistive things, should shouldn't be doing ANY ASSIGNMENT AT ALL.

Simply construct a new column like you want and assign it all at once. Profile your code.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

@jbezaire see #10280 for that other issue (there might be another one, but can't find right now). Its actually a bit non-trivial to fix.

jreback added a commit that referenced this issue Jun 5, 2015
BUG: bug in cache updating when consolidating #10264
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants