-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
view into modified dataframe of ints causes subsequent set_value to not work properly #10264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can reproduce this on master. Definitely a bug -- thanks for the report! |
I know this looks like a bug, and might be. However, you are violating the guarantees. The For a This implementation is lazy about consolidation (e.g. it won't actually consolidate on a single insert like this, but will later on). So we could call it a bug if it should always consolidate on inserts (but this can severly impact performance, that's why its like this). Or say that what you are doing (e.g. by ASSIGNING to
|
@jreback the bug (IMO) is that |
@shoyer you'd have to prove it. How are they inconsistent? |
@jreback compare lines 8 and 9 below:
|
you are using |
|
It should not be possible to put a dataframe into an inconsistent state, even if you're mis-using the pandas API. |
The state IS consistent. The issue is that the user is holding onto an incorrect reference. This is exactly |
From the docs
So we'll need to update that or call this a bug. Perhaps we should remove that section either way and encourage users to use |
@jreback what is the incorrect reference? Do we cache |
ok, guess was an odd bug, see #10272 . The code was basically wrong in 2 places, which means it worked almost all of the time. |
Thank you for your prompt attention to this issue. Following your comments I'm dismayed to discover that assigning directly into x.f.values is considered "not good practice ever". I transitioned to using x.f.values exclusively after upgrading from 0.12 to 0.15 and encountering significant performance issues using loc (I am processing very large dataframes) and becoming very confused by issues surrounding chained indexing warnings. I thought that by using values directly I was reaching into the dataframe object and reading or inserting into the underlying numpy arrays directly (which is of course MUCH faster). Inserting values into dataframes using loc can have very unintended consequences, and I have been bitten by it often enough that I've just stopped doing it. For example:
assigning a float value into column f here not only converts column f to a float64 column, but also converts all the other columns to floats as well, even though they were not touched by the assignment statement. The data I am processing often uses int64 values for enumerations (financial market order Ids), so casting to float64 doesn't just lose precision, it refers to the wrong order and so is dangerous and unacceptable. The other reason I have gravitated away from using loc, iloc, and ix is that I often want to index into row by position, but column by name. I don't believe any of the dataframe access apis allow that combination (though I would love to be corrected). What would be the recommended way of setting a particular positional slice of column e to some value? Ie what would be good practice version of: x.e.values[2:5] = np.arange(2,5) and again, I am drawn to this version because it is extremely fast, but now I am worried that it will not always produce the expected results. |
this last is a bug. haven't gotten around to fix it. However, you should know, that if you are doing performance sensistive things, should shouldn't be doing ANY ASSIGNMENT AT ALL. Simply construct a new column like you want and assign it all at once. Profile your code. |
BUG: bug in cache updating when consolidating #10264
The following code snippet produces unexpected output:
The expected output is:
The behavior only manifests if the dataframe has had a column added to it (f in this case), if the columns are ints, and if the view is constructed with certain kinds of indexing (y=x.iloc[2:] works fine for instance).
Here is my version info:
INSTALLED VERSIONS
commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.4.2.el7.jump.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.15.2.dev
nose: 1.3.4
Cython: 0.20.2
numpy: 1.9.0
scipy: 0.14.1rc1.dev-Unknown
statsmodels: None
IPython: 3.1.0
sphinx: 1.3b1
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.10
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.1.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: 2.4.3
sqlalchemy: None
pymysql: 0.6.2.None
psycopg2: 2.5.4 (dt dec pq3 ext)
The text was updated successfully, but these errors were encountered: