Meta data for DataFrame and Column #40

maartenbreddels · 2021-03-04T19:49:13Z

In order to not lose information that is encoded in DataFrames and Columns that is not covered by our API, we may want to provide extra metadata slots for these.

One may argue that this should be covered in the API, and this defeats the purpose of a standard, but I think it's a very pragmatic approach to guarantee lossless roundtripping for information outside of this standard and help adoption (because there is an escape hatch).

Example metadata for a dataframe

path: for when it's backed by a file or remote
description: metadata describing the dataframe
license: CC0, MIT
history: log of how the data was produced

Example metadata for a column:

unit: string that describes the unit ('km/s', 'parsec', 'furlong')
description: metadata describing the column
expression: in vaex, this is the expression in string form
is_index: an indicator that this column is the index in Pandas.

This could also help to round trip Arrow extension types: https://arrow.apache.org/docs/python/extending_types.html and I guess the same holds for Pandas.

An implementation could be a def get_metadata(self) -> dict[str, Any] where we recommend prefixing keys with implementation specific names, like 'arrow.extention_type', 'vaex.unit', 'pandas.extension_type_name' etc.

Commonly used keys could be upgraded to be part of the API in the future (non-prefixed keys) that we formalize and document.

FYI: metadata is a first-class citizen in the Clojure language https://clojure.org/reference/metadata

The text was updated successfully, but these errors were encountered:

rgommers · 2021-03-20T15:46:13Z

I like this idea, adding metadata at both the dataframe and column level makes sense to me.

FYI: metadata is a first-class citizen in the Clojure language https://clojure.org/reference/metadata

From this link: An important thing to understand about metadata is that it is not considered to be part of the value of an object. As such, metadata does not impact equality (or hash codes). Two objects that differ only in metadata are equal.

While we don't support comparing for equality or hashing directly, this is probably still a relevant point. It emphasizes that metadata really must be optional, and ignoring it should always be safe to do.

kkraus14 · 2021-03-22T17:16:25Z

+1 to metadata at both the dataframe and column level

rgommers · 2021-08-24T13:21:49Z

This was implemented in gh-43, so closing this issue.

rgommers added the enhancement New feature or request label Mar 20, 2021

rgommers added the interchange-protocol label Jun 25, 2021

rgommers closed this as completed Aug 24, 2021

rgommers mentioned this issue Jul 20, 2023

Add standard unit of measure support #202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta data for DataFrame and Column #40

Meta data for DataFrame and Column #40

maartenbreddels commented Mar 4, 2021

rgommers commented Mar 20, 2021

kkraus14 commented Mar 22, 2021

rgommers commented Aug 24, 2021

Meta data for DataFrame and Column #40

Meta data for DataFrame and Column #40

Comments

maartenbreddels commented Mar 4, 2021

rgommers commented Mar 20, 2021

kkraus14 commented Mar 22, 2021

rgommers commented Aug 24, 2021