-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Consistency between indexing by labels and "exporting" data #2900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You do realize that ``as_matrix()
|
The issue is that the index isn't explicit. I expect that any changes to address this wouldn't save typing or add convenience, but instead would make it harder to write incorrect code regarding the index. For your example and my example below, there is nothing explicit about the importance of the index. It just doesn't come up in the code for many situations. Again from the documentation: "Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community." Aside from it just being a debate over which approach is better, I think it also reflects a big source of confusion for users. DataFrames work mostly by label, but then as_matrix and such gives access to things that work location-based. Making the index aspect of pandas more visible would hopefully make it less confusing.
Another idea is to require an explicit handling of the index in the DataFrame constructor. Even if users can still call with index=None, at least they've explicitly acknowledged the index exists. That's the same logic behind my earlier suggestion where as_matrix would return a tuple. Rather than implicitly requiring that the user go get the index themselves, explicitly provide it and let them discard it if they actually don't need it for whatever reason.
|
I am still puzzled why you are doing you can just pass the frame directly to whatever you are doing |
that already doesn't work by just passing a frame. |
you need to use try defining this function like this
|
as indicated in my example originally, those functions don't and shouldn't have any awareness of Pandas. I have plenty of code that has nothing to do with Pandas, and I'm not going to break everything else that uses it. Anyways, this is missing the point. Explicit vs implicit. |
This is quite explicit, it is not clear what you mean when you do are you referencing a column by location (I think that is what you are doing), but if you want these function to behave like numpy arrays then you basically have 2 choices:
If you pass around a pandas object, then it acts like a pandas object, not like something else. How exactly do you find this not explicit? |
naive_AB_checker expects a 2 dimensional matrix with 2 columns. The only reason I ever passed it a frame was because you asked about when just passing a frame wouldn't work... .values makes more sense than .as_matrix. So another way to look at this is that as_matrix is a poor name. And the broader problem being that "index" is rarely explicit in any of the code. If you look at my first example, and try to imagine that you dind't already understand how pandas works... isn't the result unexpected? And isn't it dangerous, because it often doesn't produce an error (and your data is just messed up silently)? Note that index doesn't appear once in the code. And .ix might actually be assumed to be some abbreviation for index, even though that isn't how it actually works. |
Ah, to address you question more directly, the non-explicit part is how Pandas acts. Is it location indexed or not? Without reading the manual, I say the logical assumption is that it is location indexed except in special situations (which is wrong, and causes silent bugs in user code). |
this is a complicated answer - partial discussion below pandas tries to be smart about this, it is mainly label based, but does support integer indexing/slicing indexing inherently built into all objects; if you don't need it, don't use it, and just deal with numpy
pandas offers a lot of power, so it behooves the user to read up on how to do things with great power comes great responsiblity |
The documentation is really long and detailed, has good organization and plenty of examples. The downside is you can't expect any user to read the whole thing and get every nuance. Most will at best skim through it and pick out the parts that seem relevant to their current efforts. Important things like indexing behavior need to be made forefront, either by adjusting the documentation or even better making it more explicit in the code itself. I often have no use for an index, so it was easy to get started and skim through the documentation while entirely missing the fact that the index matters and is being used even if I don't touch it. More zen... Readability counts I think my idea for the constructor requiring an explicit index gets at the issue better. Given that the indexing scheme is a confusion aspect for new users, just make the index a more prominent feature wherever possible. |
and how exactly would you do that? require passing of the index to the DataFrame constructor? that seems cumbersome, much better to have the constructor figure it out from the passed (or not passed) not sure how any of the zen points you mentioned actually apply here...except: There should be one-- and preferably only one --obvious way to do it. this is true, but if you force users to be explicity ALL of the time then its worth discarding |
currently, the default for index is None. I'm saying that you could just not give it a default value. Let the user specify None explicitly to get the current behavior of an automatic index. Or give it a default value of [] which errors in the constructor. |
this is not the philosophy of pandas; if you want to do that, you are free to use numpy |
?? the philosophy of pandas is to implicitly deal with the index behavior even when it confuses users? I think we must have misunderstood eachother at some point. I'm fine with pandas' behavior. I like pandas (and that's the only reason I'm spending my own time making feedback that will hopefully improve pandas). I think that the index behavior should be made more explicit. What's wrong with the small change to the constructor? And how did it imply that I wanted to change the philosophy of pandas in any way? |
we are having an academic discussion :) I think forcing users to explicity pass an index is a burden (even if None) Often your data has enough info in order to construct an index if you dont pass one. This prevents errors. If you then want to explicity change the index, do so. the following yield exactly the same, but I think second is clearly undesirable,
|
Fair enough then. I think that 3's small burden is well worth the benefit to readability, and how clear things are to new users. |
@wesm care to chime in? |
My view is that how you interact with raw data inside a DataFrame (such as that obtained from |
@kijoshua You can be explicit about passing I certainly don't relish the thought of going back through my code to add |
@kijoshua is this closable? |
I think this falls under the Zen of Python, "explicit is better than implicit".
Documentation says "In pandas, our general viewpoint is that labels matter more than integer locations". But I've been caught a few times when trying to carry over this viewpoint to things that access a DataFrame via .as_matrix(). as_matrix returns the data but without the index labels, which requires that the integer locations be taken into account if you want to relate future results based on that matrix form back to the original DataFrame. The all-important index labels are treated implicitly by as_matrix.
The "as_" prefix suggests to me that I should get a complete view of my data, just in a different format. Since labels matter more than locations, it feels out of place for labels to be left out. This makes it easy for beginners to forget about the labels and accidentally try to apply location-based slices via .ix[] based on location indices calculated from the matrix form. This is quite easy to do if you never cared about your index labels in the first place. Since the index is created by default if not specified, DataFrames act just like they are location sliced until you remove a row or shuffle things around.
It might be worth changing as_matrix to return the index labels somehow (with a flag perhaps to not do so). either as a tuple (df.index ,data) or by pretending that the indices are the first column. This way, users would be required to consider the index labels as an important aspect of their data. It isn't just something Pandas does when printing out rows of the DataFrame, but an actual part of your data.
The text was updated successfully, but these errors were encountered: