Skip to content

Public Data Followups #23995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
TomAugspurger opened this issue Nov 29, 2018 · 10 comments
Closed
2 of 3 tasks

Public Data Followups #23995

TomAugspurger opened this issue Nov 29, 2018 · 10 comments
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 29, 2018

leftover from #23623

  1. Signature for .to_numpy(): @jorisvandenbossche proposed copy=True, which I think is good. Beyond that, we may want to control the "fidelity" of the conversion. Should Series[datetime64[ns, tz]].to_numpy() be an ndarray of Timestamp objets or an ndarray of dateimte64[ns] normalized to UTC (by default, and should we allow that to be controlled)? Can we hope for a set of keywords appropriate for all subtypes, or do we need to allow kwargs? Perhaps to_numpy(copy=True, dtype=None) will suffice?

  2. Make .array always an ExtensionArray (via @shoyer). This gives pandas a bit more freedom going forward, since the type of .array will be stable if / when we flip over to Arrow arrays by default. We'll just swap out the data backing the ExtensionArray. A generic "NumpyBackedExtensionArray" is pretty easy to write (I had one in cyberpandas). My main concern here is that it makes the statement ".array is the actual data stored in the Series / Index" falseish, but that's OK.

  3. Revert the breaking changes to Series.values for period and interval dtype data (cc @jschendel)? I think we should do this.

In [3]: sper = pd.Series(pd.period_range('2000', periods=4))

In [4]: sper.values  # on master this is the PeriodArray
Out[4]:
array([Period('2000-01-01', 'D'), Period('2000-01-02', 'D'),
       Period('2000-01-03', 'D'), Period('2000-01-04', 'D')], dtype=object)

In [5]: sper.array
Out[5]:
<PeriodArray>
['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04']
Length: 4, dtype: period[D]

In terms of LOC, it's a simple change

@@ -1984,6 +1984,16 @@ class ExtensionBlock(NonConsolidatableMixIn, Block):
         return blocks, mask


+class ObjectValuesExtensionBlock(ExtensionBlock):
+    """Block for Interval / Period data.
+
+    Only needed for backwards compatability to ensure that
+    Series[T].values is an ndarray of objects.
+    """
+    def external_values(self, dtype=None):
+        return self.values.astype(object)
+
+
 class NumericBlock(Block):
     __slots__ = ()
     is_numeric = True
@@ -3004,6 +3014,8 @@ def get_block_type(values, dtype=None):

     if is_categorical(values):
         cls = CategoricalBlock
+    elif is_interval_dtype(dtype) or is_period_dtype(dtype):
+        cls = ObjectValuesExtensionBlock

There are a couple other places (like Series._ndarray_values) that assume "extension dtype means .values is an ExtensionArray", which I've surfaced on my DatetimeArray branch. We'll need to update those to use .array anyway.


@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 29, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 29, 2018
@shoyer
Copy link
Member

shoyer commented Nov 29, 2018

Perhaps to_numpy(copy=True, dtype=None) will suffice?

This covers the use-cases that come to mind for me.

For the copy argument: presumably copy=True means always copy, copy=False means raise an error instead of copying (so you can modify the NumPy array in place and also modify the pandas object) and copy=None (the default) means copy only if necessary?

For the dtype argument: we would need to have extension array specific casting rules, e.g., so dtype='datetime64' properly casts a time-zone aware type.

@TomAugspurger
Copy link
Contributor Author

copy=False means raise an error instead of copying

I was thinking copy=False means copy only if needed, similar to NumPy. But raising when no-copy isn't possible is also presumably useful, and supporting that via copy=False and leaving copy=None for "infer" seems fine.

@h-vetinari
Copy link
Contributor

Reading through https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#accessing-the-values-in-a-series-or-index, I really think the .values -> .array shift should be done for DataFrame as well, even if there is no semantic collision like for the EAs.

Basically, Series and DataFrame should have the same fundamental API about interacting with index/data, so one might also consider a .to_numpy that (currently) only returns .array.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 5, 2018 via email

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2018
User-facing change: `Series[period].values` nad `Series[interval].values`
continues to be an ndarray of objects. Recommend ``.array`` instead.

There are a handful of related places in pandas where we assumed that
``Series[EA].values`` was an EA.

Part of pandas-dev#23995
TomAugspurger added a commit that referenced this issue Dec 9, 2018
* API: Revert breaking `.values` changes

User-facing change: `Series[period].values` nad `Series[interval].values`
continues to be an ndarray of objects. Recommend ``.array`` instead.

There are a handful of related places in pandas where we assumed that
``Series[EA].values`` was an EA.

Part of #23995
@TomAugspurger
Copy link
Contributor Author

I think we should make it so that {Series,Index}.array is always an ExtensionArray. It's not much work (I almost have a PR ready) and gives us a lot more flexibility going forward, since we aren't tightly coupled to ndarrays.

If a user wants a NumPy array, then they can use to_numpy(), or rely on np.asarray(Series[numpy].array) being no-copy, or we can provide access to the ndarray backing the NumPyBackedExtensionArray.

One potential downside: people start sticking this inside a Series / DataFrame instead of passing the actual ndarray. That would put them into the EA interface, which we may not want, especially for wide dataframes since they wouldn't be consolidated. We'll need to carefully document around this.

Does anyone have any objections to that? cc @pandas-dev/pandas-core since this is a kind of fundamental (non-breaking) change to our 1.0 data model.

@shoyer
Copy link
Member

shoyer commented Dec 10, 2018

One potential downside: people start sticking this inside a Series / DataFrame instead of passing the actual ndarray. That would put them into the EA interface, which we may not want, especially for wide dataframes since they wouldn't be consolidated. We'll need to carefully document around this.

We could also say that as a special case NumpyExtensionArray gets unwrapped in Series/DataFrame.

Being able to actually use a NumpyExtensionArray offers different advantages/disadvantages:

  • Some people would like the "non-consolidation" guarantee offered by a true NumpyExtensionArray
  • But performance would suffer for many common operations because they need to go through the extension array interface.

@TomAugspurger
Copy link
Contributor Author

But performance would suffer for many common operations because they need to go through the extension array interface.

This will make for a good test of just how much overhead we have :) Aside from non-consolidation, I suspect it won't be much, but best to verify.

@TomAugspurger
Copy link
Contributor Author

My WIP branch is at master...TomAugspurger:numpy-ea, FYI, in case anyone wants to work on it. I'm moving back to fighting with serialization issues in #24024 right now.

This was referenced Dec 10, 2018
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 16, 2018

The final piece that hasn't been addressed is the signature of Series.to_numpy().

I propose the following

def to_numpy(self, dtype=None, copy=False):
    """
    Convert the Series to a :class:`numpy.ndarray`.

    By default, this requires no coercion or copying of data
    for Series backed by a NumPy array. For Series backed by
    an ExtensionArray coercion or copying may be required if
    NumPy cannot natively hold the values of the array.

    Parameters
    ----------
    dtype : numpy.dtype
        The NumPy dtype to pass to :func:`numpy.array`.
    copy : bool, default False
        Whether to copy the underlying data.

    Returns
    -------
    ndarray
    """
    result = np.array(self.array, dtype=dtype, copy=copy)
    return result

I think that'll cover most of the use cases. In particular, it'll handle

  1. delegate conversion from array -> ndarray to the ExtensionArray, which should be handling this via the __array__ method if they want a say in how things occur.
  2. Allows for control of the fidelity, e.g. converting Series[datetime64[ns, tz] to an ndarray of Timestamp objects with dtype=object or an ndarray of datetimes with dtype='M8[ns]'.
  3. Avoids **kwargs.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2018
This is part 1 of pandas-dev#23995

We make the signature of

`to_numpy(dtype : Union[str, np.dtype], copy : bool) -> ndarray`
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2018
This is part 1 of pandas-dev#23995

We make the signature of

`to_numpy(dtype : Union[str, np.dtype], copy : bool) -> ndarray`
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2018
This is part 1 of pandas-dev#23995

We make the signature of

`to_numpy(dtype : Union[str, np.dtype], copy : bool) -> ndarray`
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2018
This is part 1 of pandas-dev#23995

We make the signature of

`to_numpy(dtype : Union[str, np.dtype], copy : bool) -> ndarray`
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2018
This is part 1 of pandas-dev#23995

We make the signature of

`to_numpy(dtype : Union[str, np.dtype], copy : bool) -> ndarray`
@TomAugspurger
Copy link
Contributor Author

I think that everything here has been addressed.

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* API: Revert breaking `.values` changes

User-facing change: `Series[period].values` nad `Series[interval].values`
continues to be an ndarray of objects. Recommend ``.array`` instead.

There are a handful of related places in pandas where we assumed that
``Series[EA].values`` was an EA.

Part of pandas-dev#23995
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* API: Revert breaking `.values` changes

User-facing change: `Series[period].values` nad `Series[interval].values`
continues to be an ndarray of objects. Recommend ``.array`` instead.

There are a handful of related places in pandas where we assumed that
``Series[EA].values`` was an EA.

Part of pandas-dev#23995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

3 participants