Skip to content

API: ExtensionArray "serialize" and "from_serialized" methods for defining roundtripping behavior #25347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mroeschke opened this issue Feb 16, 2019 · 7 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@mroeschke
Copy link
Member

xref #25308 (comment)

When EAs go through cython routines, they need to be converted to numpy arrays (and dtypes) first and then recast to the correct EA. This recast may require specific logic. For example, DatetimeTZDtypes need to be localized to UTC first before applying the dtype.

The proposal is for (optional) internal _serialize and _from_serialized methods that defines specific recasting logic for the EA. @jreback proposed something like:

serialized, dtype = EA._serialize()
original = EA._from_serialize(serialized, dtype)

Thoughts @jorisvandenbossche @TomAugspurger @jbrockmendel

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 16, 2019
@TomAugspurger
Copy link
Contributor

I don't quite follow the proposal. At least, I don't think the proposed behavior matches the names _from_serialized and _serialize`. I think serialization is typically used for a protocol / byte-level representation of an object.

IIUC, the desire is for a pair of methods that will be called before and after a (NumPy?) numeric operation. Do we have examples other than datetime-tz?

  • Categorical: something like .codes and .from_codes, but Categorical also needs to specially handle how the code -1 is handle in the op, so that doesn't quite fit.
  • Period: ordinals. Should be similar to datetime
  • Sparse: Probably doesn't want to implement ._serialized if an ndarray is the required type
  • Interval: No idea

How does _serialize differ from _ndarray_values (whose behavior isn't really defined yet)?

@mroeschke
Copy link
Member Author

"serialize" is a bit of a misnomer (just a placeholder name). Your follow-up examples are closer to the spirit of the idea.

_serialize I think would be essentially the same as _ndarray_values with additionally saving extension dtype so the pair could be passed to _from_serialized

@jreback
Copy link
Contributor

jreback commented Feb 21, 2019

this doesn't necessarily need a method on EA, maybe all we need are a pair of functions for internal use something like; we can use this when we need to go to a performant representation and back (e.g. pretty much anytime we use cython).

def to_internal_tranform(arr):
   if is_extension_array_dtype(arr):
         return arr._ndarray_values, arr.dtype, arr.na_value
   return arr, arr.dtype, np.nan

def from_internal_transform(arr, dtype, na_value):
    if is_extension_dtype(dtype):
           arr = dtype.array_type()._from_sequence(arr)
           arr = arr.astype(dtype, copy=False)
    return arr

@jbrockmendel
Copy link
Member

since _ndarray_values is gone, i think this is closeable

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Sep 22, 2020
@jreback
Copy link
Contributor

jreback commented Sep 22, 2020

we still need this general functionality though
we currently have a lot of scattered logic which converts arrays to cython types and then reconstructs the original

groupby, window, algos and nanops all do this

@jbrockmendel
Copy link
Member

We do it for datetimelike (though even that not really in window) but thats about it. there isnt a realistic way to do this for the general case

@mroeschke
Copy link
Member Author

It sounds like this use case isn't practical so closing.

@jorisvandenbossche jorisvandenbossche removed the Closing Candidate May be closeable, needs more eyeballs label Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants