Skip to content

API: added array #23581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Dec 28, 2018
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
bfefc96
added array
TomAugspurger Nov 8, 2018
51480a3
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 9, 2018
dcb7931
update registry test
TomAugspurger Nov 9, 2018
a635649
update doc examples
TomAugspurger Nov 9, 2018
fb0d8bc
wip
TomAugspurger Nov 9, 2018
d58a320
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 9, 2018
fe06de4
inference
TomAugspurger Nov 9, 2018
72f7f06
ia updates
TomAugspurger Nov 9, 2018
c02e183
test fixup
TomAugspurger Nov 10, 2018
a2d3146
isort
TomAugspurger Nov 10, 2018
37901b0
fixups
TomAugspurger Nov 10, 2018
4403010
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 12, 2018
9401dd3
wip
TomAugspurger Nov 12, 2018
838ce5e
dtype from ea
TomAugspurger Nov 12, 2018
5260b99
series, index tests
TomAugspurger Nov 12, 2018
248e9e0
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 12, 2018
cf07c80
added ndarray case
TomAugspurger Nov 12, 2018
22490a8
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 15, 2018
5e0dc62
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 17, 2018
fe40189
added test for a 2d array
TomAugspurger Nov 17, 2018
7eb9d08
TST: test for Series[EA]
TomAugspurger Nov 17, 2018
fa7b200
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 20, 2018
1ca14fe
Added test for period -> category
TomAugspurger Nov 20, 2018
4473899
copy
TomAugspurger Nov 20, 2018
382f57d
prefix for arrays
TomAugspurger Nov 20, 2018
dd76a2b
Added arrays
TomAugspurger Nov 20, 2018
159d3a2
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 21, 2018
5366950
update docstring
TomAugspurger Nov 21, 2018
c818a8f
docstring order
TomAugspurger Nov 21, 2018
ba8b807
Revert "docstring order"
TomAugspurger Nov 21, 2018
77cd782
Updates
TomAugspurger Nov 21, 2018
dfada7b
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 27, 2018
5eff701
Add docs for the types we infer
TomAugspurger Nov 27, 2018
9406400
API: disallow string alias for NumPy
TomAugspurger Nov 27, 2018
8eb07c3
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 28, 2018
ea3a118
Wrap long error message
TomAugspurger Nov 28, 2018
ecae340
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 29, 2018
fb814fc
updates
TomAugspurger Nov 29, 2018
a6f6d29
removed old test
TomAugspurger Nov 29, 2018
6c243f3
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Nov 29, 2018
86b81b5
formatting
TomAugspurger Nov 29, 2018
2c6cf97
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 8, 2018
50d4206
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 8, 2018
9e1b4e6
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 10, 2018
000967d
Raise on scalars
TomAugspurger Dec 10, 2018
bf829c3
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 11, 2018
faf114d
docs on raising
TomAugspurger Dec 11, 2018
3186ded
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 12, 2018
1c4da0e
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 28, 2018
36c6f00
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 28, 2018
932e119
Updates for PandasArray
TomAugspurger Dec 28, 2018
45d07eb
update docstring
TomAugspurger Dec 28, 2018
d1aba73
Updates
TomAugspurger Dec 28, 2018
981f735
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 28, 2018
1f3bb50
fixed test expected
TomAugspurger Dec 28, 2018
c8d3960
doc lint
TomAugspurger Dec 28, 2018
1b9e251
Merge remote-tracking branch 'upstream/master' into pd.array
TomAugspurger Dec 28, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -702,6 +702,19 @@ strings and apply several methods to it. These can be accessed like
Series.dt
Index.str


.. _api.arrays:

Arrays
------

Pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).

.. autosummary::
:toctree: generated/

array

.. _api.categorical:

Categorical
Expand Down Expand Up @@ -790,6 +803,65 @@ following usable methods and properties:
Series.cat.as_ordered
Series.cat.as_unordered

.. _api.arrays.integerna:

Integer-NA
~~~~~~~~~~

:class:`arrays.IntegerArray` can hold integer data, potentially with missing
values.

.. autosummary::
:toctree: generated/

arrays.IntegerArray

.. _api.arrays.interval:

Interval
~~~~~~~~

:class:`IntervalArray` is an array for storing data representing intervals.
The scalar type is a :class:`Interval`. These may be stored in a :class:`Series`
or as a :class:`IntervalIndex`. :class:`IntervalArray` can be closed on the
``'left'``, ``'right'``, or ``'both'``, or ``'neither'`` sides.
See :ref:`indexing.intervallindex` for more.

.. currentmodule:: pandas

.. autosummary::
:toctree: generated/

IntervalArray

.. _api.arrays.period:

Period
~~~~~~

Periods represent a span of time (e.g. the year 2000, or the hour from 11:00 to 12:00
on January 1st, 2000). A collection of :class:`Period` objects with a common frequency
can be collected in a :class:`PeriodArray`. See :ref:`timeseries.periods` for more.

.. autosummary::
:toctree: generated/

arrays.PeriodArray

Sparse
~~~~~~

Sparse data may be stored and operated on more efficiently when there is a single value
that's often repeated. :class:`SparseArray` is a container for this type of data.
See :ref:`sparse` for more.

.. _api.arrays.sparse:

.. autosummary::
:toctree: generated/

SparseArray

Plotting
~~~~~~~~

Expand Down Expand Up @@ -1676,6 +1748,7 @@ IntervalIndex Components
IntervalIndex.get_indexer
IntervalIndex.set_closed
IntervalIndex.overlaps
IntervalArray.to_tuples


.. _api.multiindex:
Expand Down Expand Up @@ -1907,6 +1980,8 @@ Methods
PeriodIndex.strftime
PeriodIndex.to_timestamp

.. api.scalars:

Scalars
-------

Expand Down
22 changes: 22 additions & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,28 @@ Reduction and groupby operations such as 'sum' work.

The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.

.. _whatsnew_0240.enhancements.array:

A new top-level method :func:`array` has been added for creating arrays (:issue:`22860`).
This can be used to create any :ref:`extension array <extending.extension-types>`, including
extension arrays registered by :ref:`3rd party libraries <ecosystem.extensions>`, or to
create NumPy arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify this as 1D here?


.. ipython:: python

pd.array([1, 2, np.nan], dtype='Int64')
pd.array(['a', 'b', 'c'], dtype='category')
pd.array([1, 2])

Notice that the default return value, if no ``dtype`` is specified, the type of
array is inferred from the data. In particular, note that the first example of
``[1, 2, np.nan]`` will return a floating-point NumPy array, since ``NaN``
is a float.

.. ipython:: python

pd.array([1, 2, np.nan])

.. _whatsnew_0240.enhancements.read_html:

``read_html`` Enhancements
Expand Down
1 change: 1 addition & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
from pandas.io.api import *
from pandas.util._tester import test
import pandas.testing
import pandas.arrays

# use the closest tagged version if possible
from ._version import get_versions
Expand Down
17 changes: 17 additions & 0 deletions pandas/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
"""
All of pandas' ExtensionArrays and ExtensionDtypes.

See :ref:`extending.extension-types` for more.
"""
from pandas.core.arrays import (
IntervalArray, PeriodArray, Categorical, SparseArray, IntegerArray,
)


__all__ = [
'Categorical',
'IntegerArray',
'IntervalArray',
'PeriodArray',
'SparseArray',
]
19 changes: 18 additions & 1 deletion pandas/core/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,26 @@

import numpy as np

from pandas.core.arrays import IntervalArray
from pandas.core.arrays.integer import (
Int8Dtype,
Int16Dtype,
Int32Dtype,
Int64Dtype,
UInt8Dtype,
UInt16Dtype,
UInt32Dtype,
UInt64Dtype,
)
from pandas.core.algorithms import factorize, unique, value_counts
from pandas.core.dtypes.missing import isna, isnull, notna, notnull
from pandas.core.arrays import Categorical
from pandas.core.dtypes.dtypes import (
CategoricalDtype,
PeriodDtype,
IntervalDtype,
DatetimeTZDtype,
)
from pandas.core.arrays import Categorical, array
from pandas.core.groupby import Grouper
from pandas.io.formats.format import set_eng_float_format
from pandas.core.index import (Index, CategoricalIndex, Int64Index,
Expand Down
1 change: 1 addition & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .array_ import array # noqa
from .base import (ExtensionArray, # noqa
ExtensionOpsMixin,
ExtensionScalarOpsMixin)
Expand Down
184 changes: 184 additions & 0 deletions pandas/core/arrays/array_.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
import numpy as np

from pandas._libs import lib, tslibs

from pandas.core.dtypes.common import is_extension_array_dtype
from pandas.core.dtypes.dtypes import registry
from pandas.core.dtypes.generic import ABCIndexClass, ABCSeries

from pandas import compat


def array(data, # type: Sequence[object]
dtype=None, # type: Optional[Union[str, np.dtype, ExtensionDtype]]
copy=True, # type: bool
):
# type: (...) -> Union[str, np.dtype, ExtensionDtype]
"""
Create an array.

.. versionadded:: 0.24.0

Parameters
----------
data : Sequence of objects
The scalars inside `data` should be instances of the
scalar type for `dtype`.

When `data` is an Index or Series, the underlying array
will be extracted from `data`.

dtype : str, np.dtype, or ExtensionDtype, optional
The dtype to use for the array. This may be a NumPy
dtype or an extension type registered with pandas using
:meth:`pandas.api.extensions.register_extension_dtype`.

If not specified, there are two possibilities:

1. When `data` is a :class:`Series`, :class:`Index`, or
:class:`ExtensionArray`, the `dtype` will be taken
from the data.
2. Otherwise, pandas will attempt to infer the `dtype`
from the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this statement a bit misleading, as we actually don't infer from scalars (at least not in the sense of eg how Series does it), we only use numpy's inference?

If we say that we infer, I would expect those to do the same:

In [3]: pd.array([pd.Timestamp("2012-01-01")])                                                                                                                                                                      
Out[3]: array([Timestamp('2012-01-01 00:00:00')], dtype=object)

In [5]: pd.array([pd.Timestamp("2012-01-01")], dtype='datetime64[ns]')                                                                                                                                              
Out[5]: array(['2012-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looking now further down in the implementation :-) I see you do handle this case for Period, so here it indeed infers:

In [6]: pd.array([pd.Period('2012-01-01', freq='D')])                                                                                                                                                               
Out[6]: 
<PeriodArray>
['2012-01-01']
Length: 1, dtype: period[D]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't stated it explicitly, but it would be nice if 3rd parties could eventually hook into this as well. Right now I think it's just Period (and maybe interval?) that get inferred. Maybe timestamps with timezones once DatetimeArray is done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not Timestamps without timezones?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that lib.infer_dtype doesn't distinguish the two

In [5]: lib.infer_dtype([pd.Timestamp('2017', tz='utc')])
Out[5]: 'datetime'

In [6]: lib.infer_dtype([pd.Timestamp('2017', tz='US/Central')])
Out[6]: 'datetime'

I've added interval to what we'll infer. Perhaps we should be explicit in the docs for that? Though that kinda closes the door to inference for 3rd party arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should be explicit in the docs for that?

Yes, I would do that.
Would we, in the longer term, the inference happening here be the same as the inference happening in Series ? Or would that actually add to much corner cases from there we don't want to carry over?

Though that kinda closes the door to inference for 3rd party arrays.

How would you envision third party arrays participate in inference? That seems a bit difficult in any case (trying out all registered ones, ..?), and IMO more error prone for users (if you forget to import the 3rd party library, you silently get different results)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we, in the longer term, the inference happening here be the same as the inference happening in Series ? Or would that actually add to much corner cases from there we don't want to carry over?

That's what I want (long term).

How would you envision third party arrays participate in inference?

Haven't thought about it beyond "check for 3rd party scalar types". I'm not familiar with how infer_dtype works.


Note that when `data` is a NumPy array, ``data.dtype`` is
*not* used for inferring the array type. This is because
NumPy cannot represent all the types of data that can be
held in extension arrays.

Currently, pandas will infer an extension dtype for sequences of

========================== ==================================
scalar type Array Type
========================== ==================================
* :class:`pandas.Interval` :class:`pandas.IntervalArray`
* :class:`pandas.Period` :class:`pandas.arrays.PeriodArray`
========================== ==================================

For all other cases, NumPy's usual inference rules will be used.

To avoid *future* breaking changing, pandas recommends using actual
dtypes, and not string aliases, for `dtype`. In other words, use

>>> pd.array([1, 2, 3], dtype=np.dtype("int32"))
array([1, 2, 3], dtype=int32)

rather than

>>> pd.array([1, 2, 3], dtype="int32")
array([1, 2, 3], dtype=int32)

If and when pandas switches to a different backend for storing arrays,
the meaning of the string aliases will change, while the actual
dtypes will be unambiguous.

copy : bool, default True
Whether to copy the data, even if not necessary. Depending
on the type of `data`, creating the new array may require
copying data, even if ``copy=False``.

Returns
-------
array : Union[numpy.ndarray, ExtensionArray]

See Also
--------
numpy.array : Construct a NumPy array.
Series : Construct a pandas Series.

Notes
-----
Omitting the `dtype` argument means pandas will attempt to infer the
best array type from the values in the data. As new array types are
added by pandas and 3rd party libraries, the "best" array type may
change. We recommend specifying `dtype` to ensure that

1. the correct array type for the data is returned
2. the returned array type doesn't change as new extension types
are added by pandas and third-party libraries

Examples
--------
If a dtype is not specified, `data` is passed through to
:meth:`numpy.array`, and an ``ndarray`` is returned.

>>> pd.array([1, 2])
array([1, 2])

Or the NumPy dtype can be specified

>>> pd.array([1, 2], dtype=np.dtype("int32"))
array([1, 2], dtype=int32)

You can use the string alias for `dtype`

>>> pd.array(['a', 'b', 'a'], dtype='category')
[a, b, a]
Categories (2, object): [a, b]

Or specify the actual dtype

>>> pd.array(['a', 'b', 'a'],
... dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True))
[a, b, a]
Categories (3, object): [a < b < c]

Because omitting the `dtype` passes the data through to NumPy,
a mixture of valid integers and NA will return a floating-point
NumPy array.

>>> pd.array([1, 2, np.nan])
array([ 1., 2., nan])

To use pandas' nullable :class:`pandas.arrays.IntegerArray`, specify
the dtype:

>>> pd.array([1, 2, np.nan], dtype='Int64')
IntegerArray([1, 2, nan], dtype='Int64')

Pandas will infer an ExtensionArray for some types of data:

>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")])
<PeriodArray>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example which raises for scalars, 2d?

['2000-01-01', '2000-01-01']
Length: 2, dtype: period[D]
"""
from pandas.core.arrays import (
period_array, ExtensionArray, IntervalArray
)

if isinstance(data, (ABCSeries, ABCIndexClass)):
data = data._values

if dtype is None and isinstance(data, ExtensionArray):
dtype = data.dtype

# this returns None for not-found dtypes.
if isinstance(dtype, compat.string_types):
dtype = registry.find(dtype) or dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger do you remember if there was any particular reason for using this pattern instead of dtype = pandas_dtype(dtype)?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall. I wonder if this predates pandas_dtype handling extension dtypes.


if is_extension_array_dtype(dtype):
cls = dtype.construct_array_type()
return cls._from_sequence(data, dtype=dtype, copy=copy)

if dtype is None:
inferred_dtype = lib.infer_dtype(data)
if inferred_dtype == 'period':
try:
return period_array(data, copy=copy)
except tslibs.IncompatibleFrequency:
# We may have a mixture of frequencies.
# We choose to return an ndarray, rather than raising.
pass
elif inferred_dtype == 'interval':
try:
return IntervalArray(data, copy=copy)
except ValueError:
# We may have a mixture of `closed` here.
# We choose to return an ndarray, rather than raising.
pass

# TODO(DatetimeArray): handle this type
# TODO(BooleanArray): handle this type

return np.array(data, dtype=dtype, copy=copy)
2 changes: 2 additions & 0 deletions pandas/core/arrays/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,9 @@
from_arrays
from_tuples
from_breaks
overlaps
set_closed
to_tuples
%(extra_methods)s\

See Also
Expand Down
Loading