Consistency between indexing by labels and "exporting" data #2900

kijoshua · 2013-02-19T16:45:21Z

I think this falls under the Zen of Python, "explicit is better than implicit".

Documentation says "In pandas, our general viewpoint is that labels matter more than integer locations". But I've been caught a few times when trying to carry over this viewpoint to things that access a DataFrame via .as_matrix(). as_matrix returns the data but without the index labels, which requires that the integer locations be taken into account if you want to relate future results based on that matrix form back to the original DataFrame. The all-important index labels are treated implicitly by as_matrix.

The "as_" prefix suggests to me that I should get a complete view of my data, just in a different format. Since labels matter more than locations, it feels out of place for labels to be left out. This makes it easy for beginners to forget about the labels and accidentally try to apply location-based slices via .ix[] based on location indices calculated from the matrix form. This is quite easy to do if you never cared about your index labels in the first place. Since the index is created by default if not specified, DataFrames act just like they are location sliced until you remove a row or shuffle things around.

It might be worth changing as_matrix to return the index labels somehow (with a flag perhaps to not do so). either as a tuple (df.index ,data) or by pretending that the indices are the first column. This way, users would be required to consider the index labels as an important aspect of their data. It isn't just something Pandas does when printing out rows of the DataFrame, but an actual part of your data.

jreback · 2013-02-19T16:53:47Z

You do realize that ``as_matrix()and the syntatic sugar.values`
just returns a numpy 2d array of the data. Is there a situation where you are trying to use
the ndarray directly rather than a DataFrame?

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(np.random.rand(10,2),columns=['A','B'])

In [4]: df
Out[4]: 
          A         B
0  0.898572  0.320656
1  0.078431  0.340781
2  0.451752  0.282544
3  0.545498  0.888366
4  0.471818  0.007790
5  0.548473  0.855393
6  0.718170  0.386948
7  0.371798  0.635438
8  0.357832  0.194048
9  0.345692  0.546124

In [5]: df.values
Out[5]: 
array([[ 0.8985716 ,  0.32065559],
       [ 0.07843058,  0.3407808 ],
       [ 0.45175201,  0.28254406],
       [ 0.54549838,  0.88836635],
       [ 0.47181816,  0.00779049],
       [ 0.54847328,  0.85539265],
       [ 0.71816954,  0.3869482 ],
       [ 0.37179806,  0.63543765],
       [ 0.35783155,  0.19404793],
       [ 0.34569207,  0.54612435]])

In [6]: df.as_matrix()
Out[6]: 
array([[ 0.8985716 ,  0.32065559],
       [ 0.07843058,  0.3407808 ],
       [ 0.45175201,  0.28254406],
       [ 0.54549838,  0.88836635],
       [ 0.47181816,  0.00779049],
       [ 0.54847328,  0.85539265],
       [ 0.71816954,  0.3869482 ],
       [ 0.37179806,  0.63543765],
       [ 0.35783155,  0.19404793],
       [ 0.34569207,  0.54612435]])

kijoshua · 2013-02-19T18:19:43Z

The issue is that the index isn't explicit. I expect that any changes to address this wouldn't save typing or add convenience, but instead would make it harder to write incorrect code regarding the index.

For your example and my example below, there is nothing explicit about the importance of the index. It just doesn't come up in the code for many situations. Again from the documentation: "Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community." Aside from it just being a debate over which approach is better, I think it also reflects a big source of confusion for users. DataFrames work mostly by label, but then as_matrix and such gives access to things that work location-based. Making the index aspect of pandas more visible would hopefully make it less confusing.

import numpy as np
import pandas as pd
from matplotlib.mlab import find

# some complicated external method that has more general uses
# is (and should be) oblivious to Pandas
def naive_A_checker(As):
    return find(As>0.4)

# some complicated external method that has more general uses
# is (and should be) oblivious to Pandas
def naive_AB_checker(ABs):
    return find(ABs[:,0]>ABs[:,1])


df = pd.DataFrame(np.random.rand(10,3),columns=['A','B','C'])

good_A = naive_A_checker(df.A)
df = df.ix[good_A,:]
# this first filtering will work correctly, but only because Pandas supplies a index
# automatically that happens to match location-based indexing.
assert((df.A>0.4).all())

good_AB = naive_AB_checker(df.ix[:,['A','B']].as_matrix())
df = df.ix[good_AB,:]
# this one will filter out unintended rows, but it'll happen silently.  Even the assertion
# might not be triggered,
assert((df.A>df.B).all()) # assertion error possible
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
'''

Another idea is to require an explicit handling of the index in the DataFrame constructor. Even if users can still call with index=None, at least they've explicitly acknowledged the index exists. That's the same logic behind my earlier suggestion where as_matrix would return a tuple. Rather than implicitly requiring that the user go get the index themselves, explicitly provide it and let them discard it if they actually don't need it for whatever reason.

# were the index actually involved, more likely to discover the mistake.
# a tuple is rather distinct from just an array of data.
alternative_as_matrix = (df.index,df.ix[:,['A','B']].as_matrix())
good_AB = naive_AB_checker(alternative_as_matrix)
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in naive_AB_checker
TypeError: tuple indices must be integers, not tuple
'''

# and a required index parameter would help make the index clearly relevant to users.
# df = pd.DataFrame(np.random.rand(10,3),columns=['A','B','C'],index=None)

jreback · 2013-02-19T19:35:54Z

I am still puzzled why you are doing as_matrix()?

you can just pass the frame directly to whatever you are doing
(and if you can't for some reason, that might be a bug)

kijoshua · 2013-02-20T14:47:49Z

naive_AB_checker(df[['A','B']])

that already doesn't work by just passing a frame.

jreback · 2013-02-20T15:05:42Z

you need to use ix here

try defining this function like this

def naive_AB_checker(ABs):
    return find(ABs.ix[:,0]>ABs.ix[:,1])

In [35]: naive_AB_checker(df[['A','B']])
Out[35]: array([0, 2, 5, 7, 9])

kijoshua · 2013-02-20T15:09:37Z

as indicated in my example originally, those functions don't and shouldn't have any awareness of Pandas. I have plenty of code that has nothing to do with Pandas, and I'm not going to break everything else that uses it.

Anyways, this is missing the point. Explicit vs implicit.

jreback · 2013-02-20T15:16:26Z

This is quite explicit, it is not clear what you mean when you do ABs[:,0]

are you referencing a column by location (I think that is what you are doing), but
this will raise an error because '0' is not a column (only 'A' and 'B') are

if you want these function to behave like numpy arrays then you basically have 2 choices:

df.values gives you the numpy array
use integer indicies on both axes

If you pass around a pandas object, then it acts like a pandas object, not like something else.

How exactly do you find this not explicit?

kijoshua · 2013-02-20T15:28:13Z

naive_AB_checker expects a 2 dimensional matrix with 2 columns. The only reason I ever passed it a frame was because you asked about when just passing a frame wouldn't work...

.values makes more sense than .as_matrix. So another way to look at this is that as_matrix is a poor name. And the broader problem being that "index" is rarely explicit in any of the code. If you look at my first example, and try to imagine that you dind't already understand how pandas works... isn't the result unexpected? And isn't it dangerous, because it often doesn't produce an error (and your data is just messed up silently)? Note that index doesn't appear once in the code. And .ix might actually be assumed to be some abbreviation for index, even though that isn't how it actually works.

kijoshua · 2013-02-20T15:31:09Z

Ah, to address you question more directly, the non-explicit part is how Pandas acts. Is it location indexed or not? Without reading the manual, I say the logical assumption is that it is location indexed except in special situations (which is wrong, and causes silent bugs in user code).

jreback · 2013-02-20T15:43:55Z

this is a complicated answer - partial discussion below

#1052

pandas tries to be smart about this, it is mainly label based, but does support integer indexing/slicing
it is not a pure matrix

indexing inherently built into all objects; if you don't need it, don't use it, and just deal with numpy
and pure location based indexing.

.values IS as_matrix(); values just calls it.
I guess it could be called as_numpy_array to be more explicit

pandas offers a lot of power, so it behooves the user to read up on how to do things

with great power comes great responsiblity

kijoshua · 2013-02-20T16:03:59Z

The documentation is really long and detailed, has good organization and plenty of examples. The downside is you can't expect any user to read the whole thing and get every nuance. Most will at best skim through it and pick out the parts that seem relevant to their current efforts. Important things like indexing behavior need to be made forefront, either by adjusting the documentation or even better making it more explicit in the code itself.

I often have no use for an index, so it was easy to get started and skim through the documentation while entirely missing the fact that the index matters and is being used even if I don't touch it.

More zen...

Readability counts
Errors should never pass silently.
Unless explicitly silenced.
There should be one-- and preferably only one --obvious way to do it.
If the implementation is hard to explain, it's a bad idea.

I think my idea for the constructor requiring an explicit index gets at the issue better. Given that the indexing scheme is a confusion aspect for new users, just make the index a more prominent feature wherever possible.

jreback · 2013-02-20T16:10:31Z

and how exactly would you do that? require passing of the index to the DataFrame constructor?

that seems cumbersome, much better to have the constructor figure it out from the passed (or not passed)
data if it can

not sure how any of the zen points you mentioned actually apply here...except:

There should be one-- and preferably only one --obvious way to do it.

this is true, but if you force users to be explicity ALL of the time then its worth discarding

kijoshua · 2013-02-20T16:14:39Z

currently, the default for index is None. I'm saying that you could just not give it a default value. Let the user specify None explicitly to get the current behavior of an automatic index. Or give it a default value of [] which errors in the constructor.

jreback · 2013-02-20T16:19:03Z

this is not the philosophy of pandas; if you want to do that, you are free to use numpy

kijoshua · 2013-02-20T16:23:00Z

?? the philosophy of pandas is to implicitly deal with the index behavior even when it confuses users? I think we must have misunderstood eachother at some point. I'm fine with pandas' behavior. I like pandas (and that's the only reason I'm spending my own time making feedback that will hopefully improve pandas). I think that the index behavior should be made more explicit.

What's wrong with the small change to the constructor? And how did it imply that I wanted to change the philosophy of pandas in any way?

jreback · 2013-02-20T16:32:23Z

we are having an academic discussion :)

I think forcing users to explicity pass an index is a burden (even if None)

Often your data has enough info in order to construct an index if you dont pass one. This prevents errors.

If you then want to explicity change the index, do so.

the following yield exactly the same, but I think second is clearly undesirable,
third is what you are proposing, but is really unecessary and a burden

In [2]: df = pd.DataFrame(dict( A = [1,2,3], B = 1 ))

In [3]: df
Out[3]: 
   A  B
0  1  1
1  2  1
2  3  1

In [4]: df = pd.DataFrame(dict( A = [1,2,3], B = 1 ), index = [1,2,3])

In [5]: df
Out[5]: 
   A  B
1  1  1
2  2  1
3  3  1

In [6]: df = pd.DataFrame(dict( A = [1,2,3], B = 1 ), index = None)

In [7]: df
Out[7]: 
   A  B
0  1  1
1  2  1
2  3  1

kijoshua · 2013-02-20T16:40:43Z

Fair enough then. I think that 3's small burden is well worth the benefit to readability, and how clear things are to new users.

jreback · 2013-02-20T16:58:50Z

@wesm care to chime in?

wesm · 2013-02-20T17:05:45Z

My view is that how you interact with raw data inside a DataFrame (such as that obtained from df.values or df.as_matrix()) is "not our problem" as typically that data is being passed off to a library like SciPy or statsmodels or sklearn. pandas made a very early decision to depart from NumPy indexing semantics in the interest of being practical and consistent across the library in its treatment of axis labelling; to change that or add extra burden in the index specification would be extremely disruptive to the user base. I would personally find it very annoying to have to be explicit about the index. It's important to recognize that pandas is a "relational data tool" and not an "array library". I'd be interested in having a very rigid index-less table object that is more array-like in its semantics, but DataFrame in its current incarnation is too far down that other path.

hayd · 2013-02-21T13:34:31Z

@kijoshua You can be explicit about passing index=None in each time you construct a DataFrame, if you think this will make your code more readable then you can force yourself and your team to do it. As @jreback points out, this is the current behaviour.

I certainly don't relish the thought of going back through my code to add index=None to every DataFrame construction, and there's many others out there who've already written a lot of production code this would break. To be honest, the current behaviour is clear after creating your first DataFrame, having to pass index=None each time would be just another thing to remember and confuse new users (as well as being semantically dubious).

jreback · 2013-03-10T20:10:09Z

@kijoshua is this closable?

jreback closed this as completed Mar 19, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency between indexing by labels and "exporting" data #2900

Consistency between indexing by labels and "exporting" data #2900

kijoshua commented Feb 19, 2013

jreback commented Feb 19, 2013

kijoshua commented Feb 19, 2013

jreback commented Feb 19, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

wesm commented Feb 20, 2013

hayd commented Feb 21, 2013

jreback commented Mar 10, 2013

Consistency between indexing by labels and "exporting" data #2900

Consistency between indexing by labels and "exporting" data #2900

Comments

kijoshua commented Feb 19, 2013

jreback commented Feb 19, 2013

kijoshua commented Feb 19, 2013

jreback commented Feb 19, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

kijoshua commented Feb 20, 2013

jreback commented Feb 20, 2013

wesm commented Feb 20, 2013

hayd commented Feb 21, 2013

jreback commented Mar 10, 2013