-
Notifications
You must be signed in to change notification settings - Fork 21
Higher-dimensional "columns" #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting question, thanks @jni. My first thought is that Unless I'm missing something, a
If there'd be multiple libraries with a |
I agree that a top level I interpreted the question as wanting to store n-dimensional data in a column of a DataFrame, where presumably the first dimension is equal to the number of rows in the DataFrame. This sounds very reasonable and a worthwhile extension to support in the future. This could be supported via something similar to the Arrow FixedSizeListArray: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html |
Should the two be considered conceptually similar? For example, would operations on n-dimensional |
Ah I can see that as being feasible. In that case it hasn't got much to do with |
This seems similar to >>> recarray = np.empty(10, dtype=[('x', np.int64), ('y', np.float64, (3, 4)), ('z', str)])
>>> recarray[0]
(0, [[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]], '')
>>> recarray[0][0]
0
>>> recarray[0][1]
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
>>> recarray['x'].shape
(10,)
>>> recarray['y'].shape
(10, 3, 4) IIUC this isn't an often used feature, but it can be very powerful/expressive so I think it would be worthwhile to support if it doesn't add too much complexity.
In this world, the DataFrame would instead be an ordered/labelled set of arrays, each with a homogeneous dtype - i.e. you'd just drop the 1D requirement. |
@rgommers @kkraus14 note that I proposed that DataArrays would be conceptually equivalent to columns. An xarray Dataset would be equivalent to a Dataframe. ie, as @dhirschfeld notes, we are merely dropping the 1D requirement of a column, everything else remains the same. However, I'm not familiar enough with xarray indexing semantics to understand further implications, e.g. do indices now have to have as many dimensions as the highest-dimensional DataArray in the Dataset? |
I would argue no, with the caveat that we should enable going to array libraries zero copy if possible in these situations. I think the fact that we can have nulls at any level of the array makes them different enough where we shouldn't implement broadcasting. Additionally, primitive typed columns without nulls are functionally equivalent to a 1d-array and broadcasting isn't supported on them. Sounds like the request is for n-dimensional columns, which sounds reasonable and in scope for the project once we start tackling nested types more generally. |
@jni I would also be interested in this. scipp relies heavily on a
Without items 2.) and 3.) there is a big gap to bridge between 1.) and 4.), which is probably ok for the pandas-style DataFrame, but might limit the usefulness of a standard for non-1-d applications. Have items 2.) and 3.) been discussed anywhere? I am a bit late to the party and try to catch up with some reading... |
This is not even half-baked, but I wanted to gauge interest/feasibility for the spec to encapsulate n-dimensional "columns" of data, equivalent to xarray's DataArrays. In that case, the currently-envisioned columns would be the 1D specific case of a higher-D general case. We've found that in some use cases we need these in napari (napari/napari#2592, napari/napari#2917), and it would be awesome to conform to the dataframe API and be compatible with both xarray and pandas.
Of course the other way around this is to ignore the higher-D libraries, and have them conform to the API once it's settled. That might be more reasonable, in which case, I'm perfectly happy for this to be closed. 😊
The text was updated successfully, but these errors were encountered: