Skip to content

Consistency between indexing by labels and "exporting" data #2900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kijoshua opened this issue Feb 19, 2013 · 21 comments
Closed

Consistency between indexing by labels and "exporting" data #2900

kijoshua opened this issue Feb 19, 2013 · 21 comments

Comments

@kijoshua
Copy link

I think this falls under the Zen of Python, "explicit is better than implicit".

Documentation says "In pandas, our general viewpoint is that labels matter more than integer locations". But I've been caught a few times when trying to carry over this viewpoint to things that access a DataFrame via .as_matrix(). as_matrix returns the data but without the index labels, which requires that the integer locations be taken into account if you want to relate future results based on that matrix form back to the original DataFrame. The all-important index labels are treated implicitly by as_matrix.

The "as_" prefix suggests to me that I should get a complete view of my data, just in a different format. Since labels matter more than locations, it feels out of place for labels to be left out. This makes it easy for beginners to forget about the labels and accidentally try to apply location-based slices via .ix[] based on location indices calculated from the matrix form. This is quite easy to do if you never cared about your index labels in the first place. Since the index is created by default if not specified, DataFrames act just like they are location sliced until you remove a row or shuffle things around.

It might be worth changing as_matrix to return the index labels somehow (with a flag perhaps to not do so). either as a tuple (df.index ,data) or by pretending that the indices are the first column. This way, users would be required to consider the index labels as an important aspect of their data. It isn't just something Pandas does when printing out rows of the DataFrame, but an actual part of your data.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2013

You do realize that ``as_matrix()and the syntatic sugar.values`
just returns a numpy 2d array of the data. Is there a situation where you are trying to use
the ndarray directly rather than a DataFrame?

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(np.random.rand(10,2),columns=['A','B'])

In [4]: df
Out[4]: 
          A         B
0  0.898572  0.320656
1  0.078431  0.340781
2  0.451752  0.282544
3  0.545498  0.888366
4  0.471818  0.007790
5  0.548473  0.855393
6  0.718170  0.386948
7  0.371798  0.635438
8  0.357832  0.194048
9  0.345692  0.546124

In [5]: df.values
Out[5]: 
array([[ 0.8985716 ,  0.32065559],
       [ 0.07843058,  0.3407808 ],
       [ 0.45175201,  0.28254406],
       [ 0.54549838,  0.88836635],
       [ 0.47181816,  0.00779049],
       [ 0.54847328,  0.85539265],
       [ 0.71816954,  0.3869482 ],
       [ 0.37179806,  0.63543765],
       [ 0.35783155,  0.19404793],
       [ 0.34569207,  0.54612435]])

In [6]: df.as_matrix()
Out[6]: 
array([[ 0.8985716 ,  0.32065559],
       [ 0.07843058,  0.3407808 ],
       [ 0.45175201,  0.28254406],
       [ 0.54549838,  0.88836635],
       [ 0.47181816,  0.00779049],
       [ 0.54847328,  0.85539265],
       [ 0.71816954,  0.3869482 ],
       [ 0.37179806,  0.63543765],
       [ 0.35783155,  0.19404793],
       [ 0.34569207,  0.54612435]])

@kijoshua
Copy link
Author

The issue is that the index isn't explicit. I expect that any changes to address this wouldn't save typing or add convenience, but instead would make it harder to write incorrect code regarding the index.

For your example and my example below, there is nothing explicit about the importance of the index. It just doesn't come up in the code for many situations. Again from the documentation: "Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community." Aside from it just being a debate over which approach is better, I think it also reflects a big source of confusion for users. DataFrames work mostly by label, but then as_matrix and such gives access to things that work location-based. Making the index aspect of pandas more visible would hopefully make it less confusing.

import numpy as np
import pandas as pd
from matplotlib.mlab import find

# some complicated external method that has more general uses
# is (and should be) oblivious to Pandas
def naive_A_checker(As):
    return find(As>0.4)

# some complicated external method that has more general uses
# is (and should be) oblivious to Pandas
def naive_AB_checker(ABs):
    return find(ABs[:,0]>ABs[:,1])


df = pd.DataFrame(np.random.rand(10,3),columns=['A','B','C'])

good_A = naive_A_checker(df.A)
df = df.ix[good_A,:]
# this first filtering will work correctly, but only because Pandas supplies a index
# automatically that happens to match location-based indexing.
assert((df.A>0.4).all())

good_AB = naive_AB_checker(df.ix[:,['A','B']].as_matrix())
df = df.ix[good_AB,:]
# this one will filter out unintended rows, but it'll happen silently.  Even the assertion
# might not be triggered,
assert((df.A>df.B).all()) # assertion error possible
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
'''

Another idea is to require an explicit handling of the index in the DataFrame constructor. Even if users can still call with index=None, at least they've explicitly acknowledged the index exists. That's the same logic behind my earlier suggestion where as_matrix would return a tuple. Rather than implicitly requiring that the user go get the index themselves, explicitly provide it and let them discard it if they actually don't need it for whatever reason.

# were the index actually involved, more likely to discover the mistake.
# a tuple is rather distinct from just an array of data.
alternative_as_matrix = (df.index,df.ix[:,['A','B']].as_matrix())
good_AB = naive_AB_checker(alternative_as_matrix)
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in naive_AB_checker
TypeError: tuple indices must be integers, not tuple
'''

# and a required index parameter would help make the index clearly relevant to users.
# df = pd.DataFrame(np.random.rand(10,3),columns=['A','B','C'],index=None)

@jreback
Copy link
Contributor

jreback commented Feb 19, 2013

I am still puzzled why you are doing as_matrix()?

you can just pass the frame directly to whatever you are doing
(and if you can't for some reason, that might be a bug)

@kijoshua
Copy link
Author

naive_AB_checker(df[['A','B']])

that already doesn't work by just passing a frame.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2013

you need to use ix here

try defining this function like this

def naive_AB_checker(ABs):
    return find(ABs.ix[:,0]>ABs.ix[:,1])

In [35]: naive_AB_checker(df[['A','B']])
Out[35]: array([0, 2, 5, 7, 9])

@kijoshua
Copy link
Author

as indicated in my example originally, those functions don't and shouldn't have any awareness of Pandas. I have plenty of code that has nothing to do with Pandas, and I'm not going to break everything else that uses it.

Anyways, this is missing the point. Explicit vs implicit.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2013

This is quite explicit, it is not clear what you mean when you do ABs[:,0]

are you referencing a column by location (I think that is what you are doing), but
this will raise an error because '0' is not a column (only 'A' and 'B') are

if you want these function to behave like numpy arrays then you basically have 2 choices:

  1. df.values gives you the numpy array
  2. use integer indicies on both axes

If you pass around a pandas object, then it acts like a pandas object, not like something else.

How exactly do you find this not explicit?

@kijoshua
Copy link
Author

naive_AB_checker expects a 2 dimensional matrix with 2 columns. The only reason I ever passed it a frame was because you asked about when just passing a frame wouldn't work...

.values makes more sense than .as_matrix. So another way to look at this is that as_matrix is a poor name. And the broader problem being that "index" is rarely explicit in any of the code. If you look at my first example, and try to imagine that you dind't already understand how pandas works... isn't the result unexpected? And isn't it dangerous, because it often doesn't produce an error (and your data is just messed up silently)? Note that index doesn't appear once in the code. And .ix might actually be assumed to be some abbreviation for index, even though that isn't how it actually works.

@kijoshua
Copy link
Author

Ah, to address you question more directly, the non-explicit part is how Pandas acts. Is it location indexed or not? Without reading the manual, I say the logical assumption is that it is location indexed except in special situations (which is wrong, and causes silent bugs in user code).

@jreback
Copy link
Contributor

jreback commented Feb 20, 2013

this is a complicated answer - partial discussion below

#1052

pandas tries to be smart about this, it is mainly label based, but does support integer indexing/slicing
it is not a pure matrix

indexing inherently built into all objects; if you don't need it, don't use it, and just deal with numpy
and pure location based indexing.

.values IS as_matrix(); values just calls it.
I guess it could be called as_numpy_array to be more explicit

pandas offers a lot of power, so it behooves the user to read up on how to do things

with great power comes great responsiblity

@kijoshua
Copy link
Author

The documentation is really long and detailed, has good organization and plenty of examples. The downside is you can't expect any user to read the whole thing and get every nuance. Most will at best skim through it and pick out the parts that seem relevant to their current efforts. Important things like indexing behavior need to be made forefront, either by adjusting the documentation or even better making it more explicit in the code itself.

I often have no use for an index, so it was easy to get started and skim through the documentation while entirely missing the fact that the index matters and is being used even if I don't touch it.

More zen...

Readability counts
Errors should never pass silently.
Unless explicitly silenced.
There should be one-- and preferably only one --obvious way to do it.
If the implementation is hard to explain, it's a bad idea.

I think my idea for the constructor requiring an explicit index gets at the issue better. Given that the indexing scheme is a confusion aspect for new users, just make the index a more prominent feature wherever possible.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2013

and how exactly would you do that? require passing of the index to the DataFrame constructor?

that seems cumbersome, much better to have the constructor figure it out from the passed (or not passed)
data if it can

not sure how any of the zen points you mentioned actually apply here...except:

There should be one-- and preferably only one --obvious way to do it.

this is true, but if you force users to be explicity ALL of the time then its worth discarding

@kijoshua
Copy link
Author

currently, the default for index is None. I'm saying that you could just not give it a default value. Let the user specify None explicitly to get the current behavior of an automatic index. Or give it a default value of [] which errors in the constructor.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2013

this is not the philosophy of pandas; if you want to do that, you are free to use numpy

@kijoshua
Copy link
Author

?? the philosophy of pandas is to implicitly deal with the index behavior even when it confuses users? I think we must have misunderstood eachother at some point. I'm fine with pandas' behavior. I like pandas (and that's the only reason I'm spending my own time making feedback that will hopefully improve pandas). I think that the index behavior should be made more explicit.

What's wrong with the small change to the constructor? And how did it imply that I wanted to change the philosophy of pandas in any way?

@jreback
Copy link
Contributor

jreback commented Feb 20, 2013

we are having an academic discussion :)

I think forcing users to explicity pass an index is a burden (even if None)

Often your data has enough info in order to construct an index if you dont pass one. This prevents errors.

If you then want to explicity change the index, do so.

the following yield exactly the same, but I think second is clearly undesirable,
third is what you are proposing, but is really unecessary and a burden

In [2]: df = pd.DataFrame(dict( A = [1,2,3], B = 1 ))

In [3]: df
Out[3]: 
   A  B
0  1  1
1  2  1
2  3  1

In [4]: df = pd.DataFrame(dict( A = [1,2,3], B = 1 ), index = [1,2,3])

In [5]: df
Out[5]: 
   A  B
1  1  1
2  2  1
3  3  1

In [6]: df = pd.DataFrame(dict( A = [1,2,3], B = 1 ), index = None)

In [7]: df
Out[7]: 
   A  B
0  1  1
1  2  1
2  3  1

@kijoshua
Copy link
Author

Fair enough then. I think that 3's small burden is well worth the benefit to readability, and how clear things are to new users.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2013

@wesm care to chime in?

@wesm
Copy link
Member

wesm commented Feb 20, 2013

My view is that how you interact with raw data inside a DataFrame (such as that obtained from df.values or df.as_matrix()) is "not our problem" as typically that data is being passed off to a library like SciPy or statsmodels or sklearn. pandas made a very early decision to depart from NumPy indexing semantics in the interest of being practical and consistent across the library in its treatment of axis labelling; to change that or add extra burden in the index specification would be extremely disruptive to the user base. I would personally find it very annoying to have to be explicit about the index. It's important to recognize that pandas is a "relational data tool" and not an "array library". I'd be interested in having a very rigid index-less table object that is more array-like in its semantics, but DataFrame in its current incarnation is too far down that other path.

@hayd
Copy link
Contributor

hayd commented Feb 21, 2013

@kijoshua You can be explicit about passing index=None in each time you construct a DataFrame, if you think this will make your code more readable then you can force yourself and your team to do it. As @jreback points out, this is the current behaviour.

I certainly don't relish the thought of going back through my code to add index=None to every DataFrame construction, and there's many others out there who've already written a lot of production code this would break. To be honest, the current behaviour is clear after creating your first DataFrame, having to pass index=None each time would be just another thing to remember and confuse new users (as well as being semantically dubious).

@jreback
Copy link
Contributor

jreback commented Mar 10, 2013

@kijoshua is this closable?

@jreback jreback closed this as completed Mar 19, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants