BUG?: .sample() sometimes returning a view not a copy. #10736

nickeubank · 2015-08-03T18:06:24Z

I've realized the .sample() function is sometimes returning a view rather than a copy. This will cause the "setting on a view/copy" error:

df = pd.DataFrame({'col1':[5,6,7], 'col2':['a','b','c']})
a_sample = df.sample(1)
a_sample['col1'] = 'z'

I believe the behavior is the result of it using the .take() function, which is what the .sample() command uses to pull the sample the passed frame.

I'd like to suggest either:

replace .take() with .iloc()
change .take() to .take().copy() if that's faster (@shoyer suggested .take() would be a faster -- is .take().copy()?)

Thoughts?

The text was updated successfully, but these errors were encountered:

jreback · 2015-08-03T18:57:00Z

both .iloc and .take can be views on the underlying data, depending on how selections happen and if there are multi-dtypes involved. This is the point of the SettingWithCopyError.
Internally, you could always just do:

if result.is_view:
    result = result.copy()
return result

to guarantee that it will always be a copy

Of course the reason to do this is that people will simply try to mutate this and assume it should work.

nickeubank · 2015-08-03T18:57:34Z

Ok, patch coming!

shoyer · 2015-08-03T19:07:37Z

both .iloc and .take can be views on the underlying data, depending on how selections happen and if there are multi-dtypes involved.

@jreback This behavior is very surprising to me. I assumed that pandas follows the NumPy rules for copies/views with iloc/take. That is, take always returns a copy and iloc returns views only if all indexers are slices or integers.

If we have places where we are trying to convert array indexers into slices so we can do a view in iloc/take, those should be eliminated.

nickeubank · 2015-08-03T19:19:08Z

This may be a naive question, but I've been struggling a little with copy/view issues lately, so seems worth asking: is that kind of data-structure-dependent variation in behavior worth having? For example, it doesn't seem like the following commands should be different, but one (the later) gives the setting with copy error:

df = pd.DataFrame({'col1':[2,3,4], 'col2':[5,6,7]})
i = df.loc[2]
i['col1']=3


df = pd.DataFrame({'col1':[2,3,4], 'col2':['a','b','c']})
i = df.loc[2]
i['col1']=3

nickeubank · 2015-08-03T19:22:18Z

(after discovering this behavior, I've literally spent the last two days going back through a major project i'm working on, setting pd.set_option('mode.chained_assignment','raise'), and re-running days worth of code to make sure it's safe).

nickeubank · 2015-08-03T19:28:06Z

@jreback @shoyer This worth bringing up in separate issue?

jreback · 2015-08-03T19:30:59Z

@shoyer not sure what you are talking about. The semantics are the same as numpy.

If you have a single dtype, then take will give you a view if you have a slice (or sometimes when selecting by integer). Same for iloc. Multi-dtypes will always give you a copy.

This has always been true.

@nickeubank this is the entire point of the SettingWithCopyError. You can end up with what you think is a view. But then you decide to add a copy which is a different dtype. Then it becomes a copy.

jreback · 2015-08-03T19:40:23Z

@shoyer I see what you are talking about now. So .take does seem to give a copy always, while .iloc will give a view if possible. Very odd, I always though .take will give a view.

nickeubank · 2015-08-03T19:45:53Z

@jreback I guess my question is more: is this really desirable as a property from a user-design perspective? The SettingWithCopyError does indeed flag when this happens, and missing this is my own error.

But from a general design perspective, is it ideal to expect users to always be mindful of the dtypes of columns they aren't directly manipulating in their DataFrames to know what pandas is doing / are we comfortable with a non-exception warning being the only safeguard? Surely we've all lost track of warning outputs in a stream of output, no?

(I'm thinking about my colleagues in the social sciences -- this just seems the kind of subtle and potentially unnecessarily complicated behavior that will really throw them for a loop)

jreback · 2015-08-03T19:53:14Z

@nickeubank this is the fundamental design of numpy. Its less than ideal but generally not a problem. The warning never used to exists and we would get questions, like: why does this not set?

e.g.

df['foo']['bar'] = 10.

When it would work if it was a view, but the second someone added another dtype it would then not work.

The thing is you HAVE/WANT to always use views as they are extremely cheap. Otherwise you end up copying everywhere which is bad.

jreback added API Design Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 3, 2015

nickeubank mentioned this issue Aug 3, 2015

amend sample to return copy and align weight axis #10738

Merged

jreback added this to the 0.17.0 milestone Aug 3, 2015

jreback closed this as completed in #10738 Aug 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG?: .sample() sometimes returning a view not a copy. #10736

BUG?: .sample() sometimes returning a view not a copy. #10736

nickeubank commented Aug 3, 2015

jreback commented Aug 3, 2015

nickeubank commented Aug 3, 2015

shoyer commented Aug 3, 2015

nickeubank commented Aug 3, 2015

nickeubank commented Aug 3, 2015

nickeubank commented Aug 3, 2015

jreback commented Aug 3, 2015

jreback commented Aug 3, 2015

nickeubank commented Aug 3, 2015

jreback commented Aug 3, 2015

BUG?: .sample() sometimes returning a view not a copy. #10736

BUG?: .sample() sometimes returning a view not a copy. #10736

Comments

nickeubank commented Aug 3, 2015

jreback commented Aug 3, 2015

nickeubank commented Aug 3, 2015

shoyer commented Aug 3, 2015

nickeubank commented Aug 3, 2015

nickeubank commented Aug 3, 2015

nickeubank commented Aug 3, 2015

jreback commented Aug 3, 2015

jreback commented Aug 3, 2015

nickeubank commented Aug 3, 2015

jreback commented Aug 3, 2015