Skip to content

BUG?: .sample() sometimes returning a view not a copy. #10736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nickeubank opened this issue Aug 3, 2015 · 10 comments · Fixed by #10738
Closed

BUG?: .sample() sometimes returning a view not a copy. #10736

nickeubank opened this issue Aug 3, 2015 · 10 comments · Fixed by #10738
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@nickeubank
Copy link
Contributor

I've realized the .sample() function is sometimes returning a view rather than a copy. This will cause the "setting on a view/copy" error:

df = pd.DataFrame({'col1':[5,6,7], 'col2':['a','b','c']})
a_sample = df.sample(1)
a_sample['col1'] = 'z'

I believe the behavior is the result of it using the .take() function, which is what the .sample() command uses to pull the sample the passed frame.

I'd like to suggest either:

  • replace .take() with .iloc()
  • change .take() to .take().copy() if that's faster (@shoyer suggested .take() would be a faster -- is .take().copy()?)

Thoughts?

@jreback
Copy link
Contributor

jreback commented Aug 3, 2015

both .iloc and .take can be views on the underlying data, depending on how selections happen and if there are multi-dtypes involved. This is the point of the SettingWithCopyError.
Internally, you could always just do:

if result.is_view:
    result = result.copy()
return result

to guarantee that it will always be a copy

Of course the reason to do this is that people will simply try to mutate this and assume it should work.

@jreback jreback added API Design Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 3, 2015
@nickeubank
Copy link
Contributor Author

Ok, patch coming!

@shoyer
Copy link
Member

shoyer commented Aug 3, 2015

both .iloc and .take can be views on the underlying data, depending on how selections happen and if there are multi-dtypes involved.

@jreback This behavior is very surprising to me. I assumed that pandas follows the NumPy rules for copies/views with iloc/take. That is, take always returns a copy and iloc returns views only if all indexers are slices or integers.

If we have places where we are trying to convert array indexers into slices so we can do a view in iloc/take, those should be eliminated.

@nickeubank
Copy link
Contributor Author

This may be a naive question, but I've been struggling a little with copy/view issues lately, so seems worth asking: is that kind of data-structure-dependent variation in behavior worth having? For example, it doesn't seem like the following commands should be different, but one (the later) gives the setting with copy error:

df = pd.DataFrame({'col1':[2,3,4], 'col2':[5,6,7]})
i = df.loc[2]
i['col1']=3


df = pd.DataFrame({'col1':[2,3,4], 'col2':['a','b','c']})
i = df.loc[2]
i['col1']=3

@nickeubank
Copy link
Contributor Author

(after discovering this behavior, I've literally spent the last two days going back through a major project i'm working on, setting pd.set_option('mode.chained_assignment','raise'), and re-running days worth of code to make sure it's safe).

@nickeubank
Copy link
Contributor Author

@jreback @shoyer This worth bringing up in separate issue?

@jreback
Copy link
Contributor

jreback commented Aug 3, 2015

@shoyer not sure what you are talking about. The semantics are the same as numpy.

If you have a single dtype, then take will give you a view if you have a slice (or sometimes when selecting by integer). Same for iloc. Multi-dtypes will always give you a copy.

This has always been true.

@nickeubank this is the entire point of the SettingWithCopyError. You can end up with what you think is a view. But then you decide to add a copy which is a different dtype. Then it becomes a copy.

@jreback
Copy link
Contributor

jreback commented Aug 3, 2015

@shoyer I see what you are talking about now. So .take does seem to give a copy always, while .iloc will give a view if possible. Very odd, I always though .take will give a view.

@nickeubank
Copy link
Contributor Author

@jreback I guess my question is more: is this really desirable as a property from a user-design perspective? The SettingWithCopyError does indeed flag when this happens, and missing this is my own error.

But from a general design perspective, is it ideal to expect users to always be mindful of the dtypes of columns they aren't directly manipulating in their DataFrames to know what pandas is doing / are we comfortable with a non-exception warning being the only safeguard? Surely we've all lost track of warning outputs in a stream of output, no?

(I'm thinking about my colleagues in the social sciences -- this just seems the kind of subtle and potentially unnecessarily complicated behavior that will really throw them for a loop)

@jreback
Copy link
Contributor

jreback commented Aug 3, 2015

@nickeubank this is the fundamental design of numpy. Its less than ideal but generally not a problem. The warning never used to exists and we would get questions, like: why does this not set?

e.g.

df['foo']['bar'] = 10.

When it would work if it was a view, but the second someone added another dtype it would then not work.

The thing is you HAVE/WANT to always use views as they are extremely cheap. Otherwise you end up copying everywhere which is bad.

@jreback jreback added this to the 0.17.0 milestone Aug 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants