Skip to content

ENH: add NDArrayBackedExtensionArray to public API #56755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 47 commits into from

Conversation

andrewgsavage
Copy link

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Reviving #45544 as I'd like to use it for an UncertaintyArray.

tswast and others added 30 commits January 21, 2022 17:16
@mroeschke mroeschke requested a review from jbrockmendel January 8, 2024 21:18
@jbrockmendel
Copy link
Member

Can we make this "public" with some kind of documentation that we may adjust it in non-user-facing ways more aggressively than we do with regular APIs?

@andrewgsavage
Copy link
Author

Can we make this "public" with some kind of documentation that we may adjust it in non-user-facing ways more aggressively than we do with regular APIs?

added a note to the class docstring

@andrewgsavage
Copy link
Author

I'm unsure what to do about this error as to_numpy is an inherited method:

Error: /home/runner/work/pandas/pandas/pandas/core/arrays/base.py:542:EX01:pandas.api.extensions.NDArrayBackedExtensionArray.to_numpy:No examples section found

Also not sure what to do about

Error: /home/runner/work/pandas/pandas/pandas/core/arrays/datetimes.py:179:EX03:pandas.arrays.DatetimeArray:flake8 error: line 2, col 4: E121 continuation line under-indented for hanging indent

as I haven't changed that file.

@MichaelTiemannOSC
Copy link
Contributor

Given that the Pandas team was finalizing the 2.2 release, you may have hit the docs system at a bad time. I've had that happen to me. Might make sense to try merging with latest and resubmitting.

@MichaelTiemannOSC
Copy link
Contributor

I tried jumping in to see if I could fix the missing toctree problem, but two things defeated me:

  1. In the normal pandas build process (running doc/make.py --warnings-are-errors) a child process segfaulted on Mac OS 14.1.2 in a Python 3.11.4 enviroment
  2. When I tried running doc/make.py --warnings-are-errors --single reference/api/pandas.api.extensions.NDArrayBackedExtensionArray.argmax.rst (a command that failed with the toctree warning), that single file built successfully.

I'm trying to see if I can isolate a reproducer for (1) above.


Examples
--------
>>> arr = pd.array([4, 5])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't give a NDArrayBackedEA

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings for the other methods are the EA docstrings, so I've used an example similar to those here. Is there a better way of going about this? I can't see how to write an example using a NDArrayBackedEA without several lines initialising an ExtensionDtype and NDArrayBackedEA

def min(self, *, axis: Optional[int] = None, skipna: bool = True, **kwargs):
pandas.compat.numpy.function.validate_minnumpy_validate_min((), kwargs)
result = pandas.core.nanops.nanmin(
values=self._ndarray, axis=axis, mask=self.isna(), skipna=skipna
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is implicitly assuming that the ordering of self matches the ordering of self._ndarray

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an issue? If someone wants control over that they could use ExtensionArray instead of NDBackedExtensionArray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a conversation with @jorisvandenbossche two ideas were clarified for me:

  1. In the case of ordering, examples include IP addresses (https://cyberpandas.readthedocs.io/en/latest/usage.html#pandas-integration), which can be ordered in a number of ways.
  2. On the topic of composing multiple Extension Arrays (e.g. PintArray and UncertaintyArray), we might look at how the composition of PintArray and Arrow Arrays (https://arrow.apache.org/docs/python/pandas.html) might work, and then argue either how we can follow those patterns or why new patterns/assumptions are needed.

If I've misunderstood either the problem or proposed next steps, happy to edit the above to correct.

Copy link
Contributor

@MichaelTiemannOSC MichaelTiemannOSC Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the ordering question, both NumPy and Pandas select min and max (and also sort) based only on the real component of complex numbers:

import pandas as pd
xx = pd.Series([4+3j, 3+10j, 2+4j])
xx.min()
# (2+4j)
xx.max()
# (4+3j)
xx.map(lambda x: abs(x))
# 0     5.000000
# 1    10.440307
# 2     4.472136
xx.sort_values()
# 2    2.0+ 4.0j
# 1    3.0+10.0j
# 0    4.0+ 3.0j
# dtype: complex128

Thus it is up to us to decide what sorting behavior applies to uncertainties (I propose ordering based on magnitude only, not error terms, except when error is NaN, in which case the value is treated as NaN).

When we look at the composition question (2, above), we will look at having the subclasses deal with this entirely (meaning we can implement an EA of units which might also have uncertain values). And of course update the documentation...

@simonjayhawkins
Copy link
Member

cc @jreback @jorisvandenbossche from original PR

@simonjayhawkins simonjayhawkins added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 4, 2024
Copy link
Contributor

github-actions bot commented Mar 9, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 9, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants